

An AI-native FP&A software company sold a natural-language finance copilot into finance teams, including Fortune 500. It stalled at ~60% accuracy. They were losing deals due to incorrect responses, and their prospects were preferring more AI-native platforms.
A finance copilot that answers questions about a company's numbers — grounded in the customer's actual financial data and financial modeling scenarios.
The team gave it a precise map of that data (a Semantic Data Dictionary); when someone asks a question, the system first works out what kind of question it is, then answers using rules and answer patterns built specifically for that question type (templates for the fast path).
The hardest questions — forecasts and what-if scenarios — go to a separate, purpose-built predictive engine rather than asking the language model to do math it can't be trusted on.
Wrapped around all of it is a testing system that checks every answer for accuracy using custom evals, runs security checks, and compares the leading models (OpenAI, Claude, Gemini) head-to-head to use the best one for the job. The rule: nothing reaches a customer until it passes — and accuracy is judged the way a CFO judges a forecast, on real held-out data, not a one-time demo.
TrustEvals started from the data, not the model. They built a semantic data dictionary mapping the customer's financial data — and validating it surfaced gaps that were fine for a human analyst but left a language model without enough context. With the map in place, a router classified each incoming question and answered routine ones through templated fast paths built for that question type. The hardest asks — forecasts and what-if scenarios — were sent to a separate, purpose-built predictive engine rather than trusting the LLM with math. Wrapping everything was a testing system that scored every answer against custom evals on held-out data, ran security checks, and compared OpenAI, Claude and Gemini head-to-head to pick the best model per job. The hard rule: nothing reaches a customer until it passes. Human-in-the-loop flows let the team update the eval set, catch poor answers and remediate immediately. The lesson they carried forward: start evals on day one and validate the data dictionary before locking architecture. Delivered in 4–8 weeks.
Any team building a customer-facing platform with natural-language-to-SQL at its core — ask your data a question in plain English, get the right answer back. It is sharpest in finance, across projections, analytics and reporting.
The conditions that make it pay off: a real, structured database with a deep semantic layer underneath; multiple use cases or customer tenants to support; and a product where accuracy is the bar, not a nice-to-have.






