How a Finance-Copilot Vendor Lifted NRR to 144%

A finance copilot stalled at 60% accuracy was losing deals to AI-native rivals. An eval harness pushed accuracy to 95% and cleared Fortune 500 security — lifting NRR to 144%.

144%

Lifted from an 82% baseline

4–8 weeks

Implementation Time

Unmukt Raizada

Founder & CEO

TrustEvals

Connect ↓

The Challenge

An AI-native FP&A software company sold a natural-language finance copilot into finance teams, including Fortune 500. It stalled at ~60% accuracy. They were losing deals due to incorrect responses, and their prospects were preferring more AI-native platforms.

What They Built

A finance copilot that answers questions about a company's numbers — grounded in the customer's actual financial data and financial modeling scenarios.

The team gave it a precise map of that data (a Semantic Data Dictionary); when someone asks a question, the system first works out what kind of question it is, then answers using rules and answer patterns built specifically for that question type (templates for the fast path).

The hardest questions — forecasts and what-if scenarios — go to a separate, purpose-built predictive engine rather than asking the language model to do math it can't be trusted on.

Wrapped around all of it is a testing system that checks every answer for accuracy using custom evals, runs security checks, and compares the leading models (OpenAI, Claude, Gemini) head-to-head to use the best one for the job. The rule: nothing reaches a customer until it passes — and accuracy is judged the way a CFO judges a forecast, on real held-out data, not a one-time demo.

TrustEvals started from the data, not the model. They built a semantic data dictionary mapping the customer's financial data — and validating it surfaced gaps that were fine for a human analyst but left a language model without enough context. With the map in place, a router classified each incoming question and answered routine ones through templated fast paths built for that question type. The hardest asks — forecasts and what-if scenarios — were sent to a separate, purpose-built predictive engine rather than trusting the LLM with math. Wrapping everything was a testing system that scored every answer against custom evals on held-out data, ran security checks, and compared OpenAI, Claude and Gemini head-to-head to pick the best model per job. The hard rule: nothing reaches a customer until it passes. Human-in-the-loop flows let the team update the eval set, catch poor answers and remediate immediately. The lesson they carried forward: start evals on day one and validate the data dictionary before locking architecture. Delivered in 4–8 weeks.

AI Role

Infrastructure

Pinecone (vector store) • Clickhouse • OpenRouter (multi-model: OpenAI, Claude, Gemini) • Jenkins CI

Integration Points

Grounded in the customer's financial data and modeling scenarios via the semantic data dictionary; question router hands forecasts to a separate predictive engine; eval harness (Langfuse, Promptfoo, DSPy) gates every answer before release.

Impact

144% NRR (up from 82%)

Net revenue retention — customers renewed and grew their spend once the copilot was reliable enough to trust.

60%→95% Output Accuracy

Plus a 20% CSAT improvement: 3x more customer upvotes, 40% fewer downvotes.

Live With 100+ Enterprises

Cleared all security and compliance requirements across Fortune 500 companies.

Technology Utilized

Implementation Complexity

Revenue Growth

Best Fit For

Any team building a customer-facing platform with natural-language-to-SQL at its core — ask your data a question in plain English, get the right answer back. It is sharpest in finance, across projections, analytics and reporting.

The conditions that make it pay off: a real, structured database with a deep semantic layer underneath; multiple use cases or customer tenants to support; and a product where accuracy is the bar, not a nice-to-have.

Unmukt Raizada

Founder & CEO

TrustEvals

Founder & CEO of TrustEvals. Builds AI evaluation and governance infrastructure for finance, real estate and regulated software — eval harnesses, semantic data dictionaries, and AI audits.

Get an intro

Industry:

Financial Services

Business Function:

Sales & Revenue

Company Size:

251-1,000

Project Cost:

$25K – $100K

Ownership:

Venture-Backed

Organization Type:

Private Company

AI Pattern:

Knowledge Management & Search (RAG)

Value Type:

Revenue Growth

AI Model:

Claude

Frequently Asked Questions

How did a mid-sized financial services software vendor push its finance copilot's accuracy from 60% to 95% with RAG and evals?

The experts started from the data — building a semantic data dictionary, then a router that classified each question and answered routine ones through templated fast paths while sending forecasts and what-if scenarios to a separate predictive engine. An eval harness scored every answer on held-out data and blocked anything that failed, lifting accuracy from ~60% to 95% and net revenue retention to 144%.

What AI tools and models were used in this finance copilot project?

The stack was built around RAG and a custom eval harness, comparing OpenAI, Claude, and Gemini head-to-head to pick the best model per job, with Claude as a core model. Supporting tools included LangChain, Langfuse, OpenRouter, Clickhouse, Pinecone, Jenkins, DSPy, and Promptfoo.

What results did the software vendor achieve?

Output accuracy rose from 60% to 95% with a 20% CSAT improvement (3x more upvotes, 40% fewer downvotes), the copilot went live with 100+ enterprise customers and cleared Fortune 500 security and compliance, and net revenue retention rose to 144% (up from 82%).

How long did the finance copilot project take?

It was delivered in four to eight weeks. A key lesson carried forward was to start evals on day one and validate the data dictionary before locking the architecture.

Who is this RAG-plus-eval approach best for?

Teams building a customer-facing platform with natural-language-to-SQL at its core — sharpest in finance, and where there is a real structured database with a deep semantic layer, multiple use cases or tenants, and accuracy is the bar rather than a nice-to-have.