It's 2025, and if you are serious about software development, you probably have some form of automated tests. Now, when it comes to LLMs, they are trickier to test, but on the other hand, they are also still new, so the whole industry hasn't settled on what the "best practices" are. In this blog, I'll share how my team approached it, why it can get messy, and a useful trick that made the whole process smoother.
As an industry, we suck at naming conventions. I refer to this concept as testing, but I've also encountered terms such as LLM evaluations, which seem heavily used by folks who come from data science/statistical backgrounds. In either case, what I refer to is the process of verifying output from LLMs in an automated fashion.
At Freeday, we started baking ChatGPT into our product in early 2023.
In the beginning, our tests started off simple - call an LLM and assert on the output. Such tests were also easy to automate and run in our CI/CD pipeline. As our business grew, our product evolved with more and more features that relied on LLMs to support the customer growth.
Fast forward to today, and there are no simple prompts anymore that live in a single file. Our prompts are now composed of multiple pieces, with parts of them being dynamic. For example, we might need to inject user booking details into the prompt. We also store customer-specific prompts in the database. All of these are stitched together at various points within our codebase before we hand it off to the LLMs.
One of the challenges of having such tests is deciding when to run them in the pipeline. By now, we have hundreds of tests that take a significantly long time to run. When our code is complex, the complexity spills into tests because the tests now require more setup, e.g., we need to have a database with prepopulated data, all the API keys, etc.
Taking all things into account, we wanted to be smarter about when we run these tests, so we started creating snapshots of our prompts.
The idea of snapshots is borrowed from a testing approach in web development called visual regression testing. During a visual regression test, a screenshot of a website/web app is taken and committed to the repository. On subsequent code changes, a new screenshot is taken and then compared pixel by pixel with the committed screenshot to produce a pass or a fail.
Instead of generating screenshots, we produce a prompt snapshot, which is just a JSON file that contains 3 things:
That way, instead of running hundreds of tests in our pipeline, we take 2 minutes to generate our snapshots and compare them. This also prompts (😁) the person who raised the pull request to run the tests, fix them if needed, and commit the new snapshots.
Our team greatly benefited from reducing time and complexity in our pipeline. We deploy multiple times per day, so waiting less allows us to iterate faster. Other than costing time, LLM tests can also cost you money because every token counts, and our snapshots enable us to focus only on the tests that are tied to the code change.
The hard part was that we had to build this ourselves. While there are a few tools in the space of LLM testing, I think everything is still in very early stages. I'm looking forward to seeing how this unfolds. How is your team coping with this problem? Would love to hear thoughts on this topic!