Case study: AI-native bookkeeping service

The brief: An outsourced accounting firm ran month-end close for roughly 600 SMB clients on a team of forty offshore bookkeepers. Margins were thin, the team turned over twice a year, and growth meant hiring — which meant onboarding, which meant errors. The founders did not want a bookkeeping copilot. They wanted the close to mostly do itself, with their people moved up to review and advisory.

What we found in the audit

The “service” was already a process: bank-feed categorisation, supplier-invoice matching, intercompany reconciliations, accruals, and a working-paper pack a senior signed before filing.
About 70% of transactions were repetitive and rules-driven — the same vendors, the same coding, month after month. The remaining 30% was where the judgement (and the errors) lived.
Quality was already defined: a senior reviewer re-checked junior work against a checklist. That checklist was, in effect, an eval rubric nobody had written down.

What we built

Wrote the eval set first. 12,000 historical transactions across forty clients, re-graded by two senior reviewers, covering categorisation, VAT treatment, and “should this have been flagged to a human.”
A categorisation and reconciliation agent. Routed pipeline — a cheap model for the repetitive 70%, a larger model gated by confidence for the ambiguous tail, deterministic rules for anything touching VAT or fixed assets. Direct integration into Xero and QuickBooks via their APIs, client-specific chart-of-accounts mappings included.
A working-paper generator. The agent assembles the month-end pack — reconciliations, supporting schedules, and a plain-English note on every judgement it made and every item it escalated.
A review console. Reviewers see only the exceptions and the agent’s reasoning, with one-click accept or correct. Every correction feeds the next eval run.
Shadow then switch. Six weeks running every client in shadow, graded against the human close, before any client was moved over a cohort at a time.

What changed

Transactions auto-coded within the firm’s accuracy bar: 71% → 93%, with the rest routed to a human.
Average close time per client down from 5.5 hours of bookkeeper time to 1.2 hours of reviewer time.
Clients per reviewer rose from 15 to 48 without a drop in the firm’s internal QA score.
Restated filings in the two quarters after switchover: down 38% versus the same period the prior year.
The firm took on 140 new clients in the following two quarters without adding bookkeepers — the unit of growth stopped being a headcount.

What we left behind

An eval suite the senior reviewers now own and extend, the agent and its integrations running on the firm’s own accounts, and a back office that scales with volume instead of headcount. The forty bookkeepers were not laid off — twenty-six retrained as reviewers and client advisers on higher-margin work; the rest were redeployed as the client base grew into them. The service is the agent. The people moved up.

What we found in the audit

What we built

What changed

What we left behind

We use cookies