As actuarial teams start using generative AI for documentation, code generation, anomaly detection and, increasingly, assumption setting, the validation playbook needs explicit extensions. The existing playbook was designed for deterministic models. It does not yet cover the failure modes that matter when an AI system is in the loop. We propose four extensions and a structured way to apply them.

What the existing playbook does well

Actuarial validation, done properly, already covers methodology, implementation, inputs, outputs, controls and documentation. For deterministic models, this works. The model is a function: given the same inputs, you get the same outputs. Validation tests are stable across runs. A model approved in March behaves the same way in October.

Where AI breaks the assumptions

Once an AI system sits in the workflow — even as a quiet helper — three of the playbook’s load-bearing assumptions weaken at the same time.

Determinism. Foundation-model outputs are not deterministic across runs by default. Two calls with the same prompt can return different results. Even with temperature settings at zero, model providers reserve the right to change the underlying weights without warning, and routinely do.

Stationarity. The model you validated in March is, in a meaningful sense, not the same model in October. Provider versions change. Prompt templates evolve. Underlying retrieval indices grow stale. None of this shows up in your normal validation cycle.

Surface area. A deterministic model fails in predictable ways. An AI system can fail in ways that do not look like failure — confidently producing plausible-but-wrong output that a tired reviewer signs off because it reads correctly.

Four extensions worth standardising

1. Drift testing. Models that were validated when they went live can behave differently three months later because the underlying foundation model changed. Validation should not be a one-shot event. Build a small fixed test set of queries and expected outputs, and re-run it on a defined cadence — weekly is reasonable. If the answers move outside tolerance, the workflow needs human attention before the next critical cycle.

2. Prompt sensitivity. Small changes in prompt phrasing can produce large changes in output. Test it explicitly. For each AI-assisted decision point, run the same input with three or four equivalent rewrites of the prompt. If the answers are stable, you have something you can rely on. If they are not, you have a workflow that needs a review step before the output is used.

3. Retrieval freshness. If the AI is grounded in your assumptions documentation, your method papers or your model code, the validity of its outputs depends on the freshness and completeness of the index. Test what is in the index. Test what is missing. Test what was last refreshed and when. This is mundane data hygiene, but it is the most common silent failure mode we see in production AI workflows.

4. Reviewer effort. Human reviewers approve AI output more readily when it looks plausible — and AI output is, by construction, plausible. Build sampling protocols that resist this. Force a fraction of outputs through a deeper review with a different reviewer; force a fraction through a deliberate adversarial check; document the rate at which deeper review changes the conclusion. If that rate is meaningfully non-zero, your light review is not catching enough.

A proposed validation structure

  1. Identify the AI-assisted decision points in the workflow. Be specific. “We use GenAI for documentation” is not a decision point; “we use GenAI to draft the methodology section of the model documentation, which is then reviewed by the model owner” is.
  2. For each decision point, specify the failure mode that would matter to the business. “Wrong assumption recommended” matters more than “documentation is slightly off-tone”.
  3. Construct test sets that probe those failure modes specifically. Drift sets, prompt-equivalent sets, retrieval-coverage sets, reviewer-effort sets.
  4. Run them at a defined cadence — not just at go-live. Weekly for high-impact workflows; monthly otherwise.
  5. Document the threshold above which the workflow should be paused for human review. Make pausing a normal, expected outcome — not a crisis.

The right framing

The right framing is not “is this AI safe to use?” — that question collapses to a yes/no answer that hides the actual risk. The right framing is “under what conditions is this AI safe to use, and how do we know we are still in those conditions?” Validation, in the AI era, becomes ongoing telemetry rather than a one-shot certification.

This is not a counsel of despair. It is a counsel of normalisation. Teams have done this for deterministic models for thirty years; the tooling and the muscle memory exist. The extensions above slot into that existing discipline. What does not work is treating AI workflows as a special case that the model risk framework does not yet cover. Boards and regulators have stopped accepting that, which is exactly as it should be.

If you need an independent validation of an AI-assisted actuarial workflow, our Modelling and Validation practice does this work — including expert-witness support where required.