Over the last year, we’ve shipped LLM-powered analytics and decision-support products into production for regulated enterprises: finance, risk, operations, and customer analytics.
2024 was around the time we went from “chat with your PDF” toys to production-ready LLM-backed solutions that sit on top of real data models, with real SLAs, and real executives asking, “Can I use this in next month’s review?”
Now through 2025, and after enough releases, incidents, and “why did it say that?” moments, we’ve observed several repetitive patterns. This is a distilled set of lessons from that journey.
Start from Workflows, Not from the Model
Most LLM projects start the same way: pick a model, pick a vector DB, build a chat UI, and then ask, “So… what should this actually do?”
The useful ones start from a different place, with questions like:
- Who is the user? Finance lead, risk analyst, or operations manager?
- What are the 10–20 recurring questions they typically ask?
- In which meetings or workflows will this show up (monthly close, quarterly review, or daily monitoring)?
Once those are written down, design stops being abstract. Instead, it shifts and you can:
- See which datasets matter and which are distractions
- Specify what a “good” answer looks like (table versus narrative, level of aggregation, and caveats to call out)
- Define success metrics in concrete terms: “If we automate these 15 questions reliably, this is worth keeping”
We also learnt to ship narrow slices first: one persona, one workflow, and one coherent data slice, end-to-end. When that slice behaves like a boring internal system (no drama at month-end), then we earn the right to add the next persona or workflow.
If you can’t ignore the system for a week without anxiety, it’s not ready for expansion.
The Data Model Is Your Real Prompt (And It Needs Domain Brains)
LLMs are very good at language, but they are not magic against messy data. Every time we tried to shortcut the data work, the system paid us back with creative but wrong answers: line items moved, new hierarchies appeared, slightly different charts of accounts arrived, and the model happily pretended nothing had changed.
The fixes were predictable but non-negotiable:
- Clear separation of raw versus curated data zones:Raw lands in an immutable store; analytics live in a curated model with its own versions, owners, and release notes.
- Data contracts that humans can read:Metric definitions, grain, tolerances, and refresh cycles
- Schema-aware prompting:The model is given the actual schema snapshot for the current tenant/use case; if that contract breaks, the request fails fast instead of hallucinating.
That’s the structural part. The second half is domain understanding.
For a demo, you can type “Act as a [domain] expert” in a prompt, pour some definitions into a vector store, and call it done. In production, that falls over quickly. Modern LLMs already know the textbook view of your domain; what they don’t know is your view and your users’ view:
- How does your organization define margin, exposure, churn, and risk buckets?
- Which adjustments are “standard” versus “one-off”?
- Which deviations matter and which are noise?
We ended up needing domain experts in three places:
- Designing the data model and contracts:The schema must reflect how the business thinks, not how a single data engineer saw the source system
- Shaping prompts and behaviors:The prompt’s job is to trigger the right “neural pathways” in the base model, compare across periods, reconcile narrative with numbers, and apply the house definitions
- Validating ground truth:“Looks right” is not good enough; validation requires worked examples, reconciliations back to known reports, and explicit sign-off from SMEs on what counts as correct or acceptable with caveats
When developers own prompts and domain definitions alone, the result is usually impressive… and quietly wrong.
Measure It Like Any Other Critical System
LLM systems fail differently from traditional apps, but the cure is familiar: instrumentation, evaluation, and cost visibility.
We converged on a surprisingly small, stubborn set of metrics and evaluation signals:
Answer Quality
This is assessed using curated question sets with expected answer characteristics (numeric tolerances, dimension coverage, and mandatory caveats). These are not perfect labels, but they are sufficient to catch drift when models, prompts, or schemas change. We started with approximately 50 curated questions and expanded to roughly 300 over three quarters as new real-world questions kept showing up.
Coverage
What percentage of real user questions can be handled without hand-offs to humans or static reports?
Evaluation Broken Down by Layer
UI, ingestion, retrieval, SQL generation, post-processing, rendering, and even graphing each had their own checks. Latency was tracked per layer as well. “It’s slow” isn’t helpful if you don’t know which part is slow.
Cost per Question
This is derived from tokens, model calls, and a simple view of cloud cost over a representative month.
Behind those metrics, we treated every API call and contract as something to instrument: ingestion pipelines, retrieval hops, SQL generation, model responses, and visualization steps all emitted structured logs with clear pass/fail or “flagged” statuses.
We used LLMs as judges sparingly, and only for things humans genuinely don’t want to do at scale; primarily natural-language tone and surface quality, never as the sole source of truth for correctness.
On top of the metrics, we learnt to keep rich traces including prompt + response pairs, the executed SQL/queries, and validation outcomes (passed, failed, and flagged).
You need these when:
- An auditor asks, “Why is this metric different from last quarter’s pack?”
- A stakeholder says, “This answer doesn’t match my spreadsheet.”
- Your CFO asks, “What is this copilot costing us, exactly?”
Think of it as unit tests and logging for a probabilistic system. “Ask the LLM again” is not a valid error-handling strategy.
Security and Safety: Think Concentric Circles, Not Magic Guardrails
Most security work looks familiar: provider due diligence (data retention, training, and residency), identity and access, network controls, logging, and compliance. Your cloud and cyber teams already know that playbook.
Where LLM systems add nuance is in how you guard the interaction between user, model, and data. The pattern that worked for us was to think in concentric circles around the data.
1.Outer Ring: Request Filtering
Before anything hits a model, run deterministic checks:
- Length and shape of input (too long, too binary, clearly not natural language)
- Basic language/profanity screening
- Simple pattern checks for prompt injection attempts, raw SQL, schema-dump patterns, etc.
This layer is not glamorous. It’s closer to a WAF for prompts. That’s the point.
2.Middle Ring: Data Access and Model Context
This is where you make sure the model never sees what it should not see:
- RBAC enforced at the storage/query layer, not just in app code
- Read-only identities wherever possible
- Limits on how much data a single question can pull, in rows and in bytes
- Context construction that only injects authorized slices of data and schema
A model cannot leak what it never had access to.
3.Inner Ring: Response Shaping and Exfiltration Checks
On the way out:
- Validate that the response stays within the intended scope (no table dumps when the question was about a single entity)
- Detect obvious exfiltration attempts or schema spew
- In higher-risk setups, add a final, deterministic “gatekeeper” step to redact or block responses
You still need the usual security architecture around this. But treating the model as one fallible component inside layered defenses is much healthier than assuming “the guardrail will handle it.”
4.Make It Inspectable If You Want People to Trust and Use It
In theory, we talk about “trustworthy AI.” In practice, most power users and executives want something simpler: “Show me how you got that number.”
Adoption improved significantly when the systems became inspectable by default, enabling you to:
- See the query or filters behind any chart
- Jump from a narrative answer into the underlying rows or documents
- Change the slice yourself (periods, entities, and metric definitions), rather than treating the copilot as a mysterious oracle
The goal is not to turn everyone into a data engineer; it is to make it obvious whether a surprising answer is a genuine signal, a data/model limitation, or just a plain bug.
A reasonable mental model is a “high-performing junior analyst”. Effectiveness improves when they show their work, cite sources, and expect questions.
5.Vendors, Versions, and Failure Modes Are Part of the Design
Finally, the unromantic bit. If you are building on cloud LLMs, assume that over a 12–18-month horizon:
- Models will change behavior, in weird and surprising ways
- SLAs and rate limits will be tested at the worst possible times
- Pricing will move, and new options will appear that you wish you could adopt
In one case, a provider changed the default behavior behind a widely used model alias. Our curated question set suddenly started failing in about 10% of cases; some answers became overcautious and dropped important caveats, others became oddly verbose, and a few began refusing questions they had previously answered.
On another occasion, we walked into an important review with multiple teams exercising the same tenant through automation runs and evaluations. A subtle rate-limit change at the provider level meant that, mid-demo, half the questions started timing out with “try again later” errors. From the user’s point of view, the entire copilot had just fallen over.
In both these cases, nothing “broke” in the infrastructure sense, but the behavior had drifted enough to break functionality or bring the demo crashing down. Those incidents forced us to treat vendors and versions as moving parts. The design responses were simple but non-negotiable:
- Graceful degradation paths: when the model is slow, down, or misbehaving, users still get something: cached answers, simpler template-based reports, or at least a clear message and a link to a canonical dashboard. “just try again” is not a strategy during a board review.
- Light abstraction over providers: Not a grand orchestration platform for every model under the sun, but enough indirection that we can move a workload from Model A to Model B without rewriting the entire product.
We also started treating model and prompt upgrades like any other production alteration, with change tickets, limited rollouts, and regression checks against our curated question sets, before anything touched a live tenant.
One of the simplest robustness tests we now use is brutal but effective: Swap your current LLM model for the equivalent “mini” version.
If you find the overall solution turns into a train wreck with wrong answers, broken SQL, incoherent charts, or nonsense caveats, then you’re not ready for enterprise production. The system is overfitting to one model’s quirks.
If, on the other hand, your users say, “The final answers could use some polish, but overall, it’s okay” then you’re on the right path. The less your product depends on a single model’s personality, the more it behaves like proper enterprise software.
Beyond the Model
Under the hood, these systems involve embeddings, retrieval strategies, prompt patterns, and orchestration. All of that matters to engineers.
But from a product point of view, the lessons that keep repeating are simpler:
- Start from concrete workflows and questions
- Treat the data model plus domain definitions as the true product
- Put domain experts in the loop for design and validation, not just for demos
- Measure quality, coverage, latency, and cost with the same seriousness as any other critical system
- Treat security as layered defense, not a single “safety switch”
- Build for observability and graceful failure, because things willgo wrong
Most of this will still apply even if base models get 10x better or 10x cheaper. The specifics of which LLM you call will change, but the hard parts, like understanding workflows, shaping data and domain definitions, building trust, and planning for failure, will not.
Do those well, and “we used an LLM” stops being the headline and becomes a single, unremarkable line in the architecture diagram. For serious enterprise systems, that is exactly where it belongs.