Thoughts on testing LLM-based systems

For the last couple of years, most companies (mine included) have been using LLMs for relatively low-stakes use cases. Chatbots. Summarizing product reviews. Generating first drafts of marketing copy. The kind of features where if the model gets it wrong, it’s a little embarrassing but nobody’s getting sued.

That’s changing. We’re shipping features now where the consequences are real — financial, legal, reputational. The kind where “the model hallucinated” isn’t a funny anecdote, it’s a problem with your name on it.

I’ve been using an analogy in my daily conversations that seems to land. Think about your home’s plumbing. There are really two separate things you’d want to verify. First, the pipes: do they connect correctly, does water come out at the right pressure, does the hot water heater work, does the dishwasher drain to the right place? These are your unit tests.

Second, the water itself: is it clean, is it safe to drink, is the color off? These are your evals — the tests that tell you whether the output is actually any good.

These are almost completely independent concerns. You can have perfect pipes and poisoned water. You can have the cleanest water in the world flowing through busted pipes.

Testing LLM systems works the same way. There’s the deterministic layer around the model, and there’s the quality of the model’s actual output. They require completely different testing approaches, and most teams I’ve seen are only really doing the first one.

The deterministic parts

This is the part we already know how to test. API contracts, integration points, response schemas, latency, error handling, retry logic. Does the prompt get assembled correctly from the template and the user’s input? Does the response get parsed and routed to the right place? Does the fallback kick in when the model returns garbage?

Traditional engineering testing handles this perfectly. Unit tests, integration tests, contract tests. The whole existing playbook works. And you absolutely need it.

But passing all of these tests can give you a false sense of security. Your system works exactly as designed. None of that tells you whether the outputs are actually good.

The non-deterministic parts

This is the hard part, and where I think our industry is still pretty early.

Take an example: say you build a feature that summarizes key clauses and obligations in vendor contracts for a legal team. Your integration tests confirm the model returns valid structured output, extracts the right sections, and handles the expected document formats. All green. But in production, the model quietly drops a liability exclusion from a summary, or mischaracterizes an indemnification clause, and someone signs based on that summary.

All the integration tests passed. The summaries came back in the right format. But the summaries themselves? Sometimes brilliant, sometimes missing critical details, and occasionally just making stuff up with total confidence. The tests couldn’t catch any of this because they were only checking that a summary came back in the right format, not whether the summary was good.

Testing output quality means answering questions like: Is the output factually correct? Is it complete? Does it handle edge cases gracefully or does it hallucinate? When it’s wrong, is it wrong in a way that’s obviously wrong (and therefore catchable by a human), or is it wrong in a way that looks plausible?

Evaluation datasets. This is the most important one, because it’s the only place where human understanding actually touches the process. Maintain curated sets of inputs with known-good outputs, scored by humans. Run these on every model change and every significant prompt change. It’s tedious to build and maintain, but it’s the closest thing you’ll have to a regression suite for output quality. Track scores over time so you can spot degradation.

Domain expert review. Have subject matter experts review a random sample of production outputs weekly. Not engineers, but the people who actually know whether a contract summary or a recommendation makes sense. This catches things that automated checks miss because the failure modes are often domain-specific and subtle.

Adversarial inputs. Specifically test the weird cases. The edge cases, the ambiguous inputs, the inputs that are trying to trick the model. Because in production, you will get weird inputs, and a non-deterministic system’s failure mode on those is usually much worse than a deterministic system’s.

Output guardrails as a testing layer. Run the outputs through a second, simpler check (sometimes another LLM call, sometimes rule-based) that flags things like: does this contain information that wasn’t in the source material? Is this output dramatically different from what we’d expect for this input type?

Multiple runs and variance. Because LLM outputs are non-deterministic, a single pass through your eval set doesn’t tell you enough. Run the same inputs multiple times and measure the variance. An output that’s great on one run and terrible on the next is a different kind of problem than one that’s consistently mediocre — and it requires a different response. If your scores swing wildly across runs, you have a stability problem on top of a quality problem, and no amount of prompt tuning will fix it until you understand which inputs are unstable and why.

Tooling. There are emerging frameworks for this — eval harnesses, LLM testing platforms, that kind of thing. In my own experience, we’ve just built our own test runner and it works great for simple use cases. The important thing is having something that lets you run your eval set repeatedly, track scores over time, and compare across prompt or model changes. Don’t let the tooling question block you from starting.

Run them often. Evals that only run once a quarter aren’t evals, they’re audits. Ideally you’re running them on every meaningful prompt or model change. But full eval suites can be slow and expensive — especially if you’re doing multiple runs for variance. If that’s the case, maintain a smaller smoke test: a curated subset of your most important and most fragile cases that you can run on demand in minutes, not hours. Save the full suite for scheduled runs or major changes, but make the smoke test cheap enough that nobody has an excuse to skip it.

Keep them separate

These should be two distinct pieces of test code. The deterministic layer should be concerned with traditional metrics — test coverage, pass/fail rates, CI integration. The output quality layer is a different beast entirely: scores trending over time, reviewed regularly, where “good enough” is a judgment call that involves the product team, not just engineering.

I’m still figuring this out

I want to be clear: I don’t have this nailed. I’m shipping features where the consequences of being wrong actually matter, and every few weeks I find a new failure mode that my testing didn’t catch. It’s humbling.

But separating these two concerns has at least helped my teams stop conflating very different problems. And it’s helped me have better conversations with product about what “tested” actually means when your system can give a different answer every time you ask it the same question.

If you’ve found approaches that work for testing the output quality side of this, I’d genuinely love to hear about it. I think we’re all still early on this one.