Insurance AI Explainability: Why Architecture Decides Regulatory Outcomes
Insurance AI programs are running into a specific kind of regulatory difficulty that model performance metrics cannot explain. Carriers are presenting AI systems with strong accuracy scores, low false-positive rates, and documented training data governance, and still receiving remediation requests from state regulators and internal audit teams asking for something the systems were not built to produce: a decision record that a non-technical reviewer can follow from input to output without engineering support.
That is an insurance AI explainability problem. It is also a design problem. And by now, the industry has enough case law and regulatory correspondence to know they are the same problem.
The carriers clearing these reviews without significant rework made one design decision early that most programs deferred: they treated explainability as a constraint on how the decision architecture was built, not as a reporting feature added after the model was deployed. The downstream consequences of that distinction are now visible in audit records, regulatory correspondence, federal court filings, and remediation timelines across the industry.
The Lawsuits That Reframed the Conversation
The explainability problem in insurance AI became impossible to dismiss as a theoretical compliance concern when it started generating litigation. Two cases in particular have reshaped how the industry thinks about what it means to produce a defensible decision record.
In Estate of Lokken v. UnitedHealth Group, plaintiffs alleged that UnitedHealth’s nH Predict AI tool—developed by its naviHealth subsidiary—routinely overrode physicians’ decisions to deny post-acute care for Medicare Advantage members. According to the complaint, the tool carried a 90% error rate: nine out of ten denied claims were ultimately reversed on appeal. Despite this, UnitedHealth continued using it. The court denied the insurer’s attempt to narrow discovery in September 2025, and in March 2026 ordered UnitedHealth to disclose the algorithm itself.
Cigna faced parallel scrutiny over its PxDx algorithm. A ProPublica investigation found that Cigna doctors denied more than 300,000 claims over two months in 2022, spending an average of 1.2 seconds reviewing each one. A federal court in California allowed the resulting class action to proceed in March 2025.
Both cases turn on explainability. Specifically, on the insurer’s inability to produce a coherent, individual-level record of how each decision was made. As the Lokken discovery ruling makes clear, courts are now allowing broad inquiry into the role of AI in claim processing, and carriers that can only produce a general description of how the system works are in a categorically different litigation position from those that can produce a clean, interpretable record of what happened in a specific case.
The principles behind these health insurance cases are transferable to every line of business where AI influences a decision that affects a policyholder.
Where Insurance AI Explainability Reviews Are Breaking Down
The pressure point varies by line of business, but the failure mode is consistent.
In claims, regulators and internal audit teams want to know why a specific claim was routed, flagged, or denied. The question is not aggregate; it is about an individual decision, a specific claimant, a particular date. An AI system that can show overall routing accuracy at 92% is not answering that question. The answer requires a record of what inputs the model received, which rules or patterns it applied, and what confidence threshold triggered that specific output. Most deployed systems were not built to produce that record per decision. They were built to produce the decision alone.
In underwriting, the pressure comes from a different direction. Models that influence risk selection, pricing, or declination decisions are subject to state-level requirements—and in some jurisdictions, NAIC model law guidance—that require documented reasoning at the individual policy level. Carriers using ensemble models or neural approaches face a particular challenge here: aggregate feature importance scores, which many systems log as their explainability layer, do not satisfy a question about why a specific application was assessed at a specific risk tier.
The gap between what regulators are asking for and what most systems were built to provide is widening since December 2023, when the NAIC adopted its Model Bulletin on the Use of Artificial Intelligence Systems by Insurers. The bulletin, adopted at the NAIC’s 2023 Fall National Meeting, puts individual-level decision documentation into direct regulatory language, reminding carriers that AI-supported decisions must comply with all applicable insurance laws, including unfair trade practices statutes, and advising insurers of the documentation a state department may request during examination. By late 2025, 23 states and Washington, D.C., had adopted the bulletin in full or substantially similar form, with a NAIC model law on third-party AI oversight anticipated in 2026.
The carriers working through examinations now are discovering that systems deployed before 2023 were built to documentation standards that no longer satisfy the examination process.
Why Explainability as a Reporting Layer Always Fails This Test
The architectural choice that determines whether an insurance AI program survives regulatory scrutiny is not which model family was used but whether explainability was designed as a constraint or appended as a feature.
When explainability is a constraint, it shapes the decision architecture from the outset. The model must produce, alongside every output, a structured record of its reasoning, logged at the moment of decision, because that is the only moment when the full decision context exists. The record travels with the output through whatever downstream system consumes it. It uses business-domain language and is queryable by case ID, by date range, and by the input factors that influenced the outcome.
When explainability is a reporting layer, the system produces decisions and documentation separately. The documentation is generated by querying the model after the fact, applying post-hoc interpretability methods, or constructing audit logs from system records that were never designed to capture decision reasoning.
Post-hoc reconstruction is not explainability. It is an attempt to reverse-engineer something that was never recorded — and regulators who understand how these systems work are increasingly recognizing the difference.
The Lokken ruling is instructive here: the court found that insurers cannot simply invoke trade secrecy to shield their algorithms from legal scrutiny, reinforcing a growing consensus that when an insurer’s own plan documents promise physician-led review, replacing that review with an AI tool without producing a defensible individual record may constitute a breach of contract.
The practical consequence is this: a carrier that built explainability as a reporting layer cannot fully satisfy a regulatory examination request for an individual decision record. Satisfying the request requires either manual reconstruction, which is expensive and unreliable at scale, or acknowledgment that the documentation standard is not being met, which is what generates the remediation order.
The broader operational challenge behind AI explainability is explored in Fulcrum Digital’s latest publication: The Enterprise AI Operating Manual. Download the manual
How the Gap Surfaces Differently Across Lines of Business
The underlying design failure is consistent. Where it becomes visible depends on the line.
In personal auto, the most immediate pressure is adverse action notice requirements. Federal and state fair lending and insurance regulations require that when a model influences a coverage decision adversely affecting a consumer, the carrier must provide a reason—in plain language, specific to that consumer’s situation. Systems that rely on generic reason codes, or that cannot trace adverse action back to specific input factors at the individual level, are generating adverse action notices that do not satisfy the requirement. New York’s Department of Financial Services Circular Letter No. 7 addresses this directly, requiring that insurers be able to articulate a logical relationship between data sources and model variables with an individual insured’s risk. At all times, not just during examination.
In commercial lines, the pressure point shifts to the underwriting file. Underwriters are expected to document the basis for risk assessment decisions, and when AI systems contribute to those decisions, the documentation standard extends to the model’s contribution. An underwriting file that says “AI-assisted” without documenting what the AI assessed and on what basis is a compliance exposure. The systems that create that exposure are the ones where the model produces a score and the underwriter applies it, with no machine-readable record of the factors the model weighted.
In claims, the exposure is often identified during litigation rather than regulatory examination. Courts and regulators are increasingly demanding explainability—the ability to understand and justify how an AI system arrived at a particular outcome—and failure to provide such explanations can undermine an insurer’s defense and risk regulatory penalties. California’s SB 1120, which took effect January 1, 2025, prohibits adverse benefit determinations based solely on an algorithm and requires individual review by a licensed clinician for any determination that affects medically necessary treatment. The principle, that automated decisions affecting policyholders require a human-reviewable record and, in high-stakes situations, actual human judgment, is extending across lines and jurisdictions.
Design Decisions That Separate Defensible Programs From Vulnerable Ones
Programs that are holding up under examination share a set of design decisions made before deployment. The timing of these decisions matters as much as the decisions themselves. Retro-fitting them after deployment is structurally difficult, expensive, and often incomplete.
Build decision logging as a first-class output, not a secondary process: Every AI decision should produce two outputs: the decision itself and a structured record of the reasoning behind it. That record needs to be human-readable without engineering support, meaning it uses business-domain language, not model feature names. It should be queryable by individual case ID, by date range, and by the input factors that influenced the decision. Systems that produce decisions without simultaneously producing this record cannot be made compliant through reporting alone.
Define the explainability standard before the model is selected: The model architecture that performs best in testing may not be the model architecture that satisfies the explainability requirement. An ensemble approach that yields two points of improvement in accuracy at the cost of interpretability may be the wrong trade-off for a carrier operating under examination guidance that requires individual-level decision documentation. This is a business constraint; it should be set by the compliance and legal functions before the model selection process begins.
Assign ownership of the decision record as explicitly as ownership of the model: In most programs, the model team owns model performance. Nobody owns the decision record. This is the gap where compliance exposure accumulates. The decision record has its own lifecycle: it needs to be produced accurately, stored in a retrievable format, retained for the required period, and made available to examiners and litigation counsel on request. That is a governed process that requires an owner, a documented standard, and a quality check. It does not manage itself.
Further reading: Our overview of AI governance policy explores how insurers translate explainability, oversight, and accountability requirements into enforceable operational controls.
An Explainability Architecture Readiness Assessment
Before any insurance AI deployment moves to production, these questions should have specific answers. Where the answer is a plan rather than a protocol, that is a gap and it will surface under examination conditions.
|
Question |
What the answer reveals |
|
Can compliance retrieve a complete decision record for any individual case by case ID, without engineering involvement? |
Whether explainability is built into the architecture or dependent on system access |
|
Does the decision record use business-domain language a non-technical reviewer can follow without a glossary? |
Whether the system was built for operational use or for internal model documentation |
|
Was the explainability standard — what the record must contain — defined before model selection? |
Whether explainability is a constraint on the system design or an afterthought |
|
Is there a named individual who owns the decision record as a compliance artifact, with defined retention and retrieval standards? |
Whether decision documentation is governed or assumed |
|
Has the explainability output been tested against the actual examination standard used by the relevant state departments? |
Whether decision documentation is governed or assumed |
Where all five have specific answers, the program has an explainability architecture. Where two or three are still open, the program has a documentation layer that will not satisfy examination and the gap will not become visible until the examination begins.
Insurance AI programs that pass every internal accuracy benchmark and still struggle under regulatory review share a common characteristic: they were built to produce decisions. The carriers coming out of these reviews without remediation orders have designed explainability into the decision chain before the first deployment went live. Not as an audit trail added afterward, but as a constraint the architecture was built around. That design decision, made at the start, is what separates a defensible AI program from one that works until someone asks it to explain itself.
Frequently Asked Questions
What is insurance AI explainability and why do regulators require it?
Insurance AI explainability is the ability to produce a clear, decision-specific record of how an AI model arrived at a particular output—a claims routing decision, an underwriting assessment, or an adverse action—in language a non-technical reviewer can follow. State regulators require it because AI decisions that affect policyholders are subject to the same documentation standards as decisions made by human underwriters and adjusters. The NAIC’s 2023 Model Bulletin on the Use of Artificial Intelligence Systems by Insurers established individual-level decision documentation as a baseline expectation, and several state insurance departments have incorporated it into examination guidance.
What is the difference between post-hoc explainability and decision-architecture explainability?
Post-hoc explainability is generated after a model has produced an output, by applying interpretability methods to the trained model or reconstructing the decision from system logs. Decision-architecture explainability is produced at the moment of decision, as a logged artifact that travels with the output. The practical difference matters under examination: post-hoc methods produce approximate explanations of how the model generally behaves; decision-architecture logging produces an exact record of what happened in a specific case. Regulators asking about individual decisions require the latter, and post-hoc reconstruction cannot fully replicate it.
Which lines of business face the most immediate regulatory exposure on explainability?
Personal auto and homeowners lines face immediate pressure from adverse action notice requirements and state-level examination guidance that has become more specific since 2023. Commercial lines face growing pressure as AI influences underwriting file documentation standards. Claims handling across lines faces exposure through both regulatory examination and litigation discovery. The carriers with the highest current exposure are those that deployed AI-assisted decision systems between 2019 and 2022 using documentation standards that predate current examination guidance.
How long does it take to retrofit explainability into a deployed insurance AI system?
Programs that attempt to retrofit explainability after deployment consistently underestimate the timeline. Where the system was not built to log decision reasoning at the point of output, satisfying the requirement means rebuilding the decision chain, which is effectively rebuilding the system. Carriers that have gone through this process report timelines of six to eighteen months and costs that exceed the original deployment budget. The programs that avoid remediation treat explainability as a deployment prerequisite rather than a post-deployment compliance project.
How does FD Ryze® address insurance AI explainability requirements?
FD Ryze® was built with structured decision logging as a core output requirement, not a reporting add-on. Every decision the platform produces generates a co-produced record: the inputs received, the pathway through the decision logic, the confidence threshold applied, and a business-domain summary a compliance reviewer can read without technical translation. The platform architecture reflects the documentation standards current in state examination guidance and was designed to satisfy adverse action notice requirements at the individual decision level. Over 100 enterprise insurance deployments informed that design; most of the explainability failure modes it addresses are patterns drawn from programs that encountered them under examination.
Key Takeaways
- Insurance AI explainability failures are architecture failures: the gap between what regulators require and what most systems produce was created at the design stage.
- The NAIC’s 2023 Model Bulletin on the Use of Artificial Intelligence Systems by Insurers established individual-level decision documentation as a regulatory baseline; state examination guidance is increasingly specific about what that requires.
- Post-hoc explainability methods cannot fully satisfy examination requests for individual decision records; the information was not captured in the required form at the moment of decision.
- The design decisions that prevent regulatory exposure are: build decision logging as a first-class output, set the explainability standard before model selection, and assign named ownership of the decision record.
- The Explainability Architecture Readiness Assessment provides five questions that should have specific answers before any insurance AI deployment moves to production.