Every active life and pension policy carries an anniversary date. On that date, the contract demands work be done: actuarial valuation, bonus or surplus allocation, lapse evaluation if a premium is overdue, regulatory reporting flags, sometimes a tax recalculation, sometimes a notification to the policyholder, almost always a journal entry. None of this is optional. The contract and the regulator both expect it to happen on that day.
Multiply this by a few million in-force policies, spread more or less uniformly across 365 calendar days, and you have a daily batch that touches tens of thousands of contracts and produces legally binding output. After running this pipeline for years, I can say with confidence: the calculation logic is rarely where things break. The scheduling layer is.
The Naive Model and Why It Fails
The first version of any anniversary engine looks like this:
- Take today's date.
- Find every active policy where
MOD(today - inception_date, 365) = 0or equivalent. - Run the anniversary procedure for each.
- Commit.
It works in UAT. It survives the first year in production. Then it starts failing, and the failures look nothing like bugs in the actuarial code. They look like philosophical disagreements about what a date is.
What "Today" Actually Means for a Contract
In batch insurance processing, "today" is at least four different things, and the scheduler has to pick one and defend it:
- Calendar today — the wall clock when the job starts.
- Accounting today — the open posting period, which may still be yesterday if month-end close has not run.
- Effective today — the date the policy treats as its anniversary, which is a legal construct, not a timestamp.
- Recovery today — the date a rerun is pretending to be, after a failed run from two days ago.
If your scheduler assumes these are the same, you will eventually post a bonus allocation into a closed period, or run a lapse check on a policy whose grace period ends tomorrow in the contract but today in the system clock. Both of these are reportable incidents.
February 29 and Other Calendar Hostility
A policy issued on February 29, 2016 has an anniversary on... when, exactly? Different products in the same portfolio answer this differently. Some treat February 28 as the anniversary in non-leap years. Some use March 1. Some push to the next business day. Some have product terms written before anyone thought about it, and the answer is whatever legal decides this morning.
The scheduler must know this per product, per jurisdiction, and ideally per policy generation, because terms change and old policies keep their original rules. A single global rule for leap years is a future bug.
The same applies to:
- Month-end anniversaries (a policy issued January 31 — what is the anniversary in February?).
- Weekend and holiday handling, where business-day shifts differ by product line.
- Daylight saving transitions, which matter more than people think when your cutoff is midnight and your servers are in a different time zone than your regulator.
The Sequencing Problem Nobody Documents
Anniversary processing is not one job. It is a chain: valuation must run before bonus allocation, bonus allocation before tax, tax before the GL posting, GL posting before the customer letter generation. Each step has its own failure mode and its own idempotency contract.
The interesting failures happen when:
- Valuation succeeds for 47,000 policies, fails for 12, and the bonus job starts anyway because someone wired it to a time trigger instead of a completion signal.
- A policy is endorsed mid-run. Its state at step 1 no longer matches its state at step 4.
- A reinsurance treaty cession runs on the same calendar day and competes for locks on the same policy rows.
The fix is not better error handling. The fix is treating the anniversary run as a transactional unit per policy, not per step. Each policy progresses through the entire chain or none of it, and the orchestrator tracks state per contract, not per job.
Recovery Is Where the Architecture Shows
The day the batch fails — and it will — is the day you learn whether your scheduler was designed or assembled.
Questions that need answers before the incident, not during it:
- If yesterday's run failed halfway, does today's run pick up yesterday's leftover policies, or does a separate recovery job handle them? Who decides?
- When you rerun a policy, do downstream systems detect the duplicate posting, or do they cheerfully book the bonus twice?
- If a policy was lapsed yesterday by mistake and reinstated this morning, does today's anniversary logic see it as continuously in-force, or as a fresh contract?
- Can the scheduler run a single policy on demand for a specific effective date without polluting the audit trail?
In one portfolio I worked with, recovery was handled by manually editing a control table and restarting the job. It worked for years. It also meant that the recovery path was untested every time it was used, which is to say, always.
What Actually Works
The patterns that have held up over multiple production cycles:
- Separate the calendar from the execution. The scheduler asks a calendar service "which policies have an anniversary effective on date X?" and the calendar service owns all the leap-year, month-end, holiday, and product-rule logic. The batch never does date arithmetic itself.
- Make effective date a parameter, not an assumption. Every job accepts
effective_dateexplicitly. Production runs pass today. Recovery runs pass the original failed date. Reruns are indistinguishable from initial runs. - Idempotency keys per policy per effective date. Downstream systems reject duplicates by key, not by heuristics.
- Per-policy state, not per-job state. The orchestrator knows policy 12345 is at step 3 of 7 for effective date 2024-06-14. Failures isolate to policies, not to the whole run.
- A dry-run mode that produces every output except the commits. Used before every quarter-end and every product launch.
The Real Lesson
The calculation engine in a life insurance batch is the part everyone reviews, tests, and trusts. It is also rarely the part that causes the 7 a.m. phone call. The scheduling layer — the thing that decides which policies are processed today, in what order, with what effective date, and what happens when it fails — is where the institutional risk lives.
Treat it like the architectural surface it is. Not like cron.