Quality control on AI output for numbers

Quality control on AI output for numbers means that for every numerical output you know how it was produced, whether it's reproducible, and which fail mode might apply. For finance teams: not spot-checks on every figure, but architectural choices — deterministic code for calculations, AI only for commentary — plus systematic sanity checks where it really counts.

One prompt where you ask an AI for a variance analysis on a spreadsheet you just pasted — that, you check by hand. You read the answer, you recognize the numbers, done. But the moment the workflow has four steps — bank transaction → categorization → posting proposal → commentary — and runs partly through an agent, that same reflex disappears. The output looks fine, so it gets approved. That is exactly where errors in AI workflows on numbers accumulate, and in finance those errors have an unpleasant property: they often only surface six months later, in an audit or in a discrepancy nobody can explain.

Quality control on numeric AI output isn't a luxury. Without it, scalable output becomes scalable mess — and in finance, scalable audit findings.

Why numeric work is extra vulnerable

LLMs are language models. They produce plausible text, and they handle numbers largely as tokens — not as computational units with validation rules. That has three consequences that hit finance especially hard.

Numeric hallucinations look correct. A model saying "the margin is 23%" sounds confident, even if the percentage is invented. A model proposing a journal entry of €87,231 where the real amount is €8,723 says it in the same tone. The familiar hallucinations — an invented legal text, a non-existent source — are relatively easy to spot. Numeric hallucinations are not.

Cumulative errors in chains. In a chain — pull bank transactions → match → posting proposal → commentary — a small error in step 1 amplifies in step 3. A misread currency symbol (USD vs EUR), a digit error in a ledger account, an unstated assumption: you only see it in the final result, where it is hard to trace.

Outdated knowledge on rules and rates. VAT rates, thresholds, tax rules, RJ and IFRS updates: these change; models have a training cutoff. An AI still working from a two-year-old VAT rate, or mentioning a deduction expired in 2025, produces useful-looking work with a fundamental flaw.

Four levels of control

1. Checkpoints per step

First rule: validate every link before continuing. Practically for finance:

Read the output of step N with the question "could a colleague without context continue from here?" If no, the handoff is wrong and the next step will likely go off the rails.
Check the hard facts. Amounts, periods, currencies, ledger accounts, customer names. A model rarely hallucinates something obviously wrong; it hallucinates near-correct details — €87,231 instead of €87,213, or the right customer but the wrong invoice.
Check the assumptions. Has the model assumed something not in your question? For example a period assumption, a ledger exclusion, a currency conversion. Assumptions belong in the open; otherwise they're hallucinations in a nice suit.

A chain is deliberately modular to make debugging easier. Use that. If step 3 rattles, run step 3 again — not the whole chain.

2. Structure checks at handoffs

Between two models or tools, the handoff is the most dangerous point. Three helpers that work in finance:

Context packets. Have every step end its output with a "Context for next step" block: which period, which tenant, which filters, which assumptions, which open items. That way you see immediately what gets passed on.
Round-trip test. Have the output of step 1 summarized by a second model as if it were the brief. Does the summary match what step 2 is going to do? If no: step 1 wasn't clear enough.
Explicit goal repetition. Repeat briefly at the start of each step what the goal of the whole chain is — for example "we're closing April for tenant X, the final result is a board pack with variance commentary." Without that repetition, a model in step 3 loses sight of the whole.

3. Numeric verification

Here the finance version is stricter than general QC. Four disciplines that work:

Always ask for the calculation, not just the answer. Not "the margin is 23%," but "the margin is 23% (gross profit €460,000 / revenue €2,000,000)." You can check the calculation in a second. This is by far the cheapest quality check and the most often skipped.
Always cite the source. Which ledger account, which period, which filter? A commentary "operating costs rose 12%" is unusable without "operating costs = ledger 4000-4999, April 2026 vs March 2026." Source citation makes the check executable.
Let a second model recompute. A numeric verifier step: after the main model has done the analysis, ask another model (or the same model with different instruction) to replicate the calculations from the source data. If results diverge, the work goes back. This catches most numeric hallucinations.
Cross-check against the books. Amounts mentioned by the AI must match what is in Exact or the tax software. For totals you can check programmatically; for breakdowns often a manual sample.

4. Final review by a human

The final step of every non-trivial chain is human review. Not "scroll and nod," but:

Ask a counter-question. "What could be wrong here?" — and answer it before passing the piece on.
Re-read the final result independently. Not in comparison to the AI output, but from the receiver's perspective (supervisory board, auditor, bank). Does the story hold up apart from the process?
Recompute one number. Pick a random number from the final report and recompute it yourself from source. If that matches, there's a good chance the other numbers do too. If it doesn't, something is wrong worth investigating further.

The difference between "correct" and "right"

A subtle pitfall in finance reporting: AI output can be numerically correct and still not right. A variance analysis can contain correct numbers but answer the wrong question — explaining the gap against budget when the gap against last year was the interesting one. A commentary can be neatly phrased but in the wrong tone for your investor. A board memo can be logically sound but miss the crucial qualitative nuance the CFO wanted to convey.

Quality control on finance AI output is therefore always twofold:

Correctness. Are the numbers right, the calculation, the sources, the regulation cited?
Fit. Does it match the brief, the audience, the phase in the quarter, the expectation previously communicated?

The first you can check partially automatically (numeric verifier, source verification). The second almost always requires a human with domain and situational knowledge. Don't trust an agent that claims to do the second.

When to take the human out, when not — finance rules of thumb

Internal, reversible, no posting → automate. A tag on an inbound invoice, your own summary, a match proposal still in the queue.
Posting, payment, external communication, tax return → always human review. Non-negotiable — even at high match confidence from the model.
Grey zone (reports for internal use, proposals not yet final) → automate with an explicit stop condition. Run the chain, generate the proposal, but put it in a review queue. That's not automation minus — it's automation plus.

Agents — extra vulnerability

An agent planning and acting on its own with access to your accounting system is a chain that also reaches into the outside world. Quality control here becomes its own discipline:

Action whitelist. Which actions can the agent take autonomously (fetch information, draft a proposal), which require approval (post, pay, send mail)? "Execute a posting" is on approval by default; "propose a match" can run autonomously.
Audit log. Every agent decision and action with timestamp, input, and outcome. Without a log, debugging is impossible and the external auditor cannot affirm your internal control.
Sanity loops. Have the agent periodically self-check: "Is this what the user meant?" Self-reflection isn't a replacement for human review but catches drift.
Circuit breaker. Automatic stop on anomalous patterns: too many actions in a short time, an amount one order of magnitude above normal, an unexpected endpoint. Standard practice in software operations, and now in finance too.

Audit grade — what this means for control work

Quality control and audit evidence overlap heavily in finance. A checkpoint log that records per step what the input was, which model was used, what answer came out, and who approved it, is at the same time the material an external auditor or internal audit needs to test the operation of internal control around AI use. Two birds, one stone: invest in QC infrastructure that already meets audit requirements (retention, immutability, traceability to a person). Then you don't have to bolt an audit layer on top later.

Saldus in practice

In Saldus a numeric verifier layer is built into the Q&A flow by default: every answer containing an amount is checked for traceability to a specific source in the books. If the amount doesn't come from a validated source, the verifier blocks and the agent has to retry or hand off to a human. Every tool call (which data was retrieved from Exact, with which filters) is logged in an audit trail. It doesn't remove human review — that stays the final check — but it shifts the bar at which wrong numbers roll through unseen.