Case study: off-the-shelf agent rip-out

The brief: A B2B payments company had bought an off-the-shelf “AI support agent” eight months earlier. It demoed well. It went live in two weeks. By the time it called us, the agent had become the single biggest source of operational chaos in the support organisation — and the CFO had just asked the head of operations to explain a £190k annual licence renewal.

What was actually happening

The agent was auto-replying to roughly 60% of inbound tickets. About a third of those replies were quietly wrong — confidently quoting fee structures from a country the company no longer operates in, or referencing product features that had been deprecated nine months earlier.
The agent wrote to the CRM. It set deal-stage fields. It tagged accounts. The vendor’s connector understood a generic Salesforce schema; the company had thirty-eight custom fields the connector ignored, several of which were what the sales team actually used. Records were silently degrading.
There was no replayable audit trail. When a customer complained about a wrong answer, the support team could see that the agent replied. They could not see why. The vendor’s logs were summarised, not raw.
The team had built a manual override workflow — a Slack channel where supervisors reviewed the agent’s outputs after the fact and re-replied to customers. The agent had created a second shift of work the company was now paying its support staff overtime to do.

The packaged agent was not poorly built. It was built for a generic problem. The company’s problem had stopped being generic the moment it expanded into three new jurisdictions and three new product lines.

What we did

A two-week readiness audit gave us the honest scope. We then ran a fourteen-week engagement: design, build, shadow, switch.

Wrote the eval set first. 480 historical tickets, regraded by the QA lead, the head of compliance, and a senior support manager. The rubric covered correctness, tone, and “did the agent route to the right human when it should have.”
Built a custom agent. Routed pipeline — cheap small model for the easy 80%, larger model gated by a confidence threshold for ambiguous tickets. Tool calls into the real CRM schema, including the thirty-eight custom fields. Hard rules around money, refunds, and anything jurisdiction-sensitive.
Traces from commit one. Every model call replayable. Every CRM write reversible. A read-only “explain this reply” link in the support tool so any supervisor could see exactly what the agent saw before it spoke.
Shadow mode for four weeks. The agent ran on every ticket but did not reply. The support team graded its drafts. We retuned the rubric and the routing thresholds against actual disagreement, not against the vendor’s accuracy claim.
Switched over a queue at a time. Behind feature flags. With the off-the-shelf agent still running on queues we had not migrated, in case we needed to roll back.

What changed

Auto-reply accuracy measured against the in-house rubric: 67% → 94% on the same ticket population.
Median first-touch reply time down from 9 minutes to 1 minute 40 seconds.
Supervisor override workload dropped 71%. The second shift was retired.
Annual run-rate cost for the agent (model calls + Cravings retainer): £74k versus the £190k renewal it replaced. The off-the-shelf vendor was not renewed.
Compliance findings on the post-switchover audit: zero.

What we left behind

The eval suite, the runbook, the architecture documentation, and a custom agent the in-house team now owns. Two of the support engineers we retrained on the system have since become the company’s first AI engineers. The thirty-eight custom CRM fields are still there. The agent uses them.

What was actually happening

What we did

What changed

What we left behind

We use cookies