Bertold Kolics
Verosint (USA)


Bertold Kolics is a testing specialist with many years of experience working with engineering teams delivering commercial software. Practicing varied roles at software companies of all sizes during his career, Bertold developed a unique understanding that allows businesses to bring value to end users while satisfying customer expectations at the same time. He had the opportunity to work on legacy software projects, to improve projects inherited through acquisitions, and to bring greenfield projects to the market. Bertold enjoys collaborating in cross-functional groups with diverse backgrounds and constantly seeks ways that lead to high-performing teams. He also held management positions where he got to hone transformational leadership skills with open, honest communication that resonates well with team and business needs. Bertold has first-hand experience building, developing, and testing distributed software products created with languages such as JavaScript, Java, Go, and C/C++. Bertold currently works at Verosint, a small early-stage software startup that set out to help businesses detect and prevent online account fraud. As a sole software testing specialist, in addition to traditional quality assurance tasks, he works hard to enable his team to elevate software delivery practices. Modern testing principles guide his approach, where production data instead of gut feelings drive decisions. The sudden emergence of large-language models is forcing businesses - including Verosint - to embrace artificial intelligence in their offerings. Bertold had a chance to experiment with and develop testing strategies to guarantee expected outcomes in this new and unique realm that goes above and beyond traditional software testing practices. You can find Bertold, his social media links, past talks, and appearances at He welcomes new connections and is open to discussing software challenges in English and his native language, Hungarian.

About the Presentation

From the Trenches: Automated Testing of Large Language Models

The adoption of generative AI and large language models is a must-have in many enterprises to stay ahead of the competition. Even the small startup I work for is no exception. In this presentation, I would like to share my journey about the testing strategy and implementation that helped us to create sustainable and scalable automated tests that suit our needs. While our implementation of using and testing LLMs is specific to our context, the purpose of using LLMs is not: to significantly lower the barrier for end-users to understand and take action on the data that our application surfaces.


The presentation will break down the journey into these sections:

* Exploration and Effective Collaboration with all stakeholders: our product managers, engineering decision-makers, and engineers were all learning the ways of large language models together. Effectively sharing findings and learnings was critical to speed up this learning process in our geographically distributed team. Attendees will see the process broken down with screenshots of the tooling.

* Implementation challenges: our team quickly learned that the preferred backend language was unsuitable for the rapid, iterative work that involved almost the entire team. Switching to Python for LLM-related tasks helped us not just to collaborate but to take advantage of the large expanse of libraries supporting data science, generative AI work, and automated testing.

* Defining the appropriate scope and level of automated testing: we quickly learned that the output from LLM can vary from one run to another when using a 3rd-party service. Traditional assertions are not appropriate in this context. Instead, the test implementation honed in on the critical properties of LLM output – using specific terms, for example, and other characteristics such as the length of the responses. In the presentation, I will also show that confirming the absence of particular properties was also critical in our context.

* Benefits of automated testing: with the right breadth and depth of automated testing in place, we found that regression testing when experimenting with new LLM parameters – such as temperature – or trying new models, or providers built confidence that regressions will be less likely. Attendees will see code excerpts from the implementation.

* Leveraging data from production: a critical quality attribute for understanding how the application behaves in a production context is observability. Understanding how prompts, and prompting changes, combined with customer data affected the output of LLMs provided was key for us to validate our assumptions and alter the scope, and coverage of automated tests. Screenshots using metrics, traces, and structured logs will show this process in detail.