About the Presentation
From the Trenches: Automated Testing of Large Language Models
The adoption of generative AI and large language models is a must-have in many enterprises to stay ahead of the competition. Even the small startup I work for is no exception. In this presentation, I would like to share my journey about the testing strategy and implementation that helped us to create sustainable and scalable automated tests that suit our needs. While our implementation of using and testing LLMs is specific to our context, the purpose of using LLMs is not: to significantly lower the barrier for end-users to understand and take action on the data that our application surfaces.
The presentation will break down the journey into these sections:
* Exploration and Effective Collaboration with all stakeholders: our product managers, engineering decision-makers, and engineers were all learning the ways of large language models together. Effectively sharing findings and learnings was critical to speed up this learning process in our geographically distributed team. Attendees will see the process broken down with screenshots of the tooling.
* Implementation challenges: our team quickly learned that the preferred backend language was unsuitable for the rapid, iterative work that involved almost the entire team. Switching to Python for LLM-related tasks helped us not just to collaborate but to take advantage of the large expanse of libraries supporting data science, generative AI work, and automated testing.
* Defining the appropriate scope and level of automated testing: we quickly learned that the output from LLM can vary from one run to another when using a 3rd-party service. Traditional assertions are not appropriate in this context. Instead, the test implementation honed in on the critical properties of LLM output – using specific terms, for example, and other characteristics such as the length of the responses. In the presentation, I will also show that confirming the absence of particular properties was also critical in our context.
* Benefits of automated testing: with the right breadth and depth of automated testing in place, we found that regression testing when experimenting with new LLM parameters – such as temperature – or trying new models, or providers built confidence that regressions will be less likely. Attendees will see code excerpts from the implementation.
* Leveraging data from production: a critical quality attribute for understanding how the application behaves in a production context is observability. Understanding how prompts, and prompting changes, combined with customer data affected the output of LLMs provided was key for us to validate our assumptions and alter the scope, and coverage of automated tests. Screenshots using metrics, traces, and structured logs will show this process in detail.