Working with AI agents: a practical guide for actuaries

The most common mistake actuarial teams are making with AI agents in 2026 is choosing the most autonomous pattern they can defend, when most of the productivity is in the most boring one they can ship. The technology is no longer the constraint. The constraint is judgement about which deployment pattern earns its place — and the discipline to retire the agents that do not.

By Johann van Rooyen, FASSA, CEO

What an “agent” actually is

Strip the marketing and an AI agent is a piece of software that can be given a goal, work through it across several steps, use other tools along the way, and stop to either ask for input or hand back a result. A large language model sits at the centre. What turns the model from a chatbot into an agent is the ability to read documents, query a database, run a calculation, draft a file and check its own output.

Anthropic’s engineering team draws a useful distinction in their Building Effective Agents guide: workflows are systems where a human pre-defines the steps; agents are systems where the model itself decides which steps to take and which tools to use. Their advice is unambiguous — start with the simplest pattern that works, and only add agentic complexity when you genuinely need it. For actuarial work, almost everything useful today sits closer to a guided workflow than to a fully autonomous agent. That is by design, not by accident.

Think of an agent as a very fast, very literal new analyst who has read a lot, never gets tired, follows instructions exactly, and will confidently make things up unless given the right material to work from.

Five deployment patterns, ranked smallest-first

Most actuarial teams will get most of their value from the first three patterns and never need the last two. Pick from the top of this list, not the bottom.

1. Single-shot copilot. Microsoft 365 Copilot inside Excel, Word, PowerPoint and Outlook. Enterprise-grade, no engineering required, immediate productivity for senior actuaries reading long documents and drafting commentary. Highest immediate ROI; lowest risk surface. Most teams should start here and stop here for at least a quarter.

2. Document corpus + retrieval (RAG). A controlled set of approved documents — product wordings, methodology notes, prior memos, valuation guidance — with citation-backed Q&A on top. Removes a remarkable volume of low-value back-and-forth from senior actuaries’ inboxes. The pattern most likely to surprise the team with how useful it is.

3. Code-execution agent. Plain-English data exploration (“show lapse rates by product and duration for the last three years, by sales channel”) that returns reviewable SQL or Python rather than a black-box answer. Translation of legacy VBA or SAS into Python. Reconciliation triage. Best deployed inside the warehouse or notebook environment your team already uses, with read-only access.

4. Workflow-with-handoffs. A scoped, supervised pipeline — for example, a quarterly reserving narrative — that the agent drafts end-to-end, with explicit human review gates at each step. Defensible because every gate has a named owner. Worth the investment only after patterns 1–3 are working well.

5. Sub-agents under a manager. One agent decomposing a goal into chunks, delegating to other agents, reviewing their output. The pattern Anthropic and others have shown can work for software engineering. We have not yet seen a regulated actuarial workflow where this earns its place over pattern 4. Most teams should not yet be here.

Where agents fail in actuarial work, specifically

Generic warnings about hallucination are not enough to design controls around. Four failure modes recur in actuarial deployments, and each has a specific countermeasure.

Stale assumption resolution. The agent quotes a lapse rate that was current six months ago because the assumption document it was retrieved from has not been updated. Fix: tie the assumption store to a versioned Symphony Assumptions service, not to a folder of PDFs.
Hallucinated regulatory references. The agent cites an IFRS 17 paragraph or a Prudential Authority directive that does not exist, or that exists but says something different. Fix: ground every regulatory citation in a fixed, version-controlled rule book; reject ungrounded outputs.
Plausible-but-wrong methodology summaries. The agent produces a method note that reads correctly but conflates two different methods or quietly drops the caveat that protects the firm. Fix: enforce structured templates with mandatory caveat fields; review specifically for omissions, not just additions.
Cross-product leakage. Wording from one product’s memo leaks into another product’s memo because retrieval matched on style rather than on product identity. Fix: filter the retrieval index by product before the model sees results.

The general pattern: agents tend to be confidently wrong rather than visibly confused. Validation must be designed for that failure mode — see our piece on independent validation for AI-assisted actuarial workflows for the full discipline.

Two failure surfaces: the bot and the human

Risk conversations about AI agents tend to focus only on the agent. That is half the picture. Errors enter the workflow from two surfaces now, and the controls have to cover both.

Bot-driven errors are the ones the literature describes: hallucinated facts, fabricated regulatory citations, plausible-but-wrong methodology, retrieval that returned the wrong document confidently. These failures are not random — they have a signature. The agent is fluent, certain, and wrong in ways that read as right. When an agent has the ability to act — write to a database, send an email, post a number into a return — the same failure compounds, because an unreviewed wrong answer becomes a wrong action. Multi-step or sub-agent patterns make this worse: a small error in step one is rationalised in step two and treated as established fact by step three.

Human-driven errors are the ones that quietly get worse as AI usage scales, and they are easy to under-weight in risk papers. A reviewer who used to read fifteen pages of working in detail is now signing off forty pages of agent-drafted commentary in the same window. The risk is not that the human stops reading — it is that the human starts skimming, because the output looks more polished than the last twenty things that passed review. Rubber-stamping is the new failure mode. So is sloppy upstream prompting — a vague brief produces a confident, off-target answer that the reviewer is now obliged to catch. And so is the inverse: a reviewer who, having been burned by hallucination once, treats every output as suspect and quietly stops adding value.

Both error modes have the same mitigation, and it is the oldest one in actuarial practice: proper review levels. First-line preparation, second-line independent review, third-line sign-off by an accountable senior. Nothing about AI changes that hierarchy. What AI changes is the economics of running it. The grind work — reconciliation, formatting, rebuilding the prior period — that used to consume sixty or seventy percent of a reviewer’s time should now consume a fraction of it. In theory there is more reviewer time available than at any point in the last two decades, and that time has to be spent on the thing the agent cannot do: judging whether the answer is the right answer.

The teams that get this wrong cut review headcount because “the agent does the work now.” The teams that get it right keep the review headcount and redirect it toward deeper, slower, more sceptical reading of fewer but higher-stakes outputs. The fruit of AI investment, properly designed, is not fewer actuaries reviewing — it is the same actuaries reviewing more carefully.

Manage agents like junior staff — only more carefully

The cultural shift the source IAA papers describe is real. Five practical management moves keep it on the rails.

Scope card. Every agent has a one-page document naming its purpose, the data it can read, the data it cannot, the outputs it produces, and the named human reviewer.
Weekly evaluation cadence. A small fixed test set, run weekly, for any agent in production. Drift outside tolerance pauses the agent until the human reviewer signs it off again.
Monthly red-team rotation. A different team member tries to break the agent every month — adversarial prompts, edge cases, deliberate misdirection. Findings flow into the test set.
Named escalation path. When to pause is decided in advance, not in the moment. “If the agent cites a regulation it cannot ground, the workflow halts” is a much better rule than “use judgement.”
Retire deliberately. An agent that stops earning its place gets retired, not patched. The catalogue of live agents stays small on purpose.

The IP question, compressed

The single biggest blocker to AI adoption inside South African insurers is the legitimate fear that confidential data ends up training someone else’s model. The answer is now clear: it is the contract you operate under, not the brand of model you use. Microsoft 365 Copilot under Enterprise Data Protection, ChatGPT Enterprise, and Claude for Work / Enterprise under Anthropic Commercial Terms all contractually exclude training on customer inputs. Consumer-grade ChatGPT and free chatbots do not. The fix is not to ban AI — staff will use it anyway — but to give them an enterprise-safe alternative that is at least as easy.

Governance — name what you’re following, don’t invent

The supervisory picture is now coherent. The IAA AI Task Force has published a working set of papers on AI Governance, Testing and Documentation. EIOPA’s August 2025 Opinion sets supervisory expectations on data governance, fairness, explainability and human oversight. The NAIC Model Bulletin is now in force in over half of US states. In South Africa, the November 2025 joint FSCA / Prudential Authority report on AI in the financial sector found banks at around 52 percent adoption versus insurers at 8 percent, and signalled further sector-specific guidance is coming. Cross-sector, the NIST AI Risk Management Framework and ISO/IEC 42001 keep showing up as the underlying scaffolds.

Map your firm’s AI use to one of these. Stay close to the actuarial-specific guidance from the IAA. The ASSA Code of Professional Conduct still applies in full — integrity, competence and care, impartiality, communication — regardless of how the analysis was produced. If the actuarial function does not lead AI governance in your firm, someone else will, and they may not see the assumption, lineage and documentation issues the way you do.

A pragmatic place to start

If your firm has done little so far, the next three months matter more than the next three years. Pick one workflow with clear inputs and a clear approval point — a quarterly reserving narrative, an assumption review pack, a regulator response. Run it twice in parallel for a month: once the way you do it today, once supported by an enterprise-safe agent under pattern 1 or 2 above with a named senior actuary as the supervisor. Measure cycle time, error rate and the quality of the final review meeting. That single comparison is more persuasive than any vendor presentation, and it teaches the team what good looks like in your specific context.

The promise of AI agents in actuarial work is not that the actuary disappears. It is the opposite. Routine work compresses, judgement work expands, and the actuary becomes more visibly accountable for what is decided and how it is communicated. The teams who treat agents as a forcing function for the engineering work they have been deferring — versioned data, lineage, evaluation harnesses — get the compounding return. The teams who bolt agents onto the legacy estate find, expensively, that the chaos amplifies.

If you want to scope what an enterprise-safe agent setup would look like for your firm, our Finance Modernisation practice and our Symphony Automate product cover the orchestration, evaluation and governance that production AI workflows actually require.