Introduction

Introduction to eval-driven development

Testing is the most important step in both LLM application development and monitoring its behavior in production. In Machine Learning and Artificial Intelligence, testing is referred to by different names depending on when and how you test.

When testing an LLM application offline (e.g. during development, or periodically on test cases generated from a sampling of application traces), the process is often referred to as evaluation. When testing the application's behavior online, the testing is referred to as guardrails. We'll cover both of these in more depth.

Evaluation

Evaluation is the testing and assessment of how well the LLM is performing for the task it was designed to solve.

For retrieval augmented generation (RAG) chatbots, developers will evaluate how well does the chatbot answer questions. Given the wide array of capabilities for LLMs, the process of evaluating its performance has become significantly more difficult than non-LLM models of the past. OpenAI releases benchmarks (evaluations) for their models and they include measuring their model's performance for question answers, math, reasoning, multitask language understanding, etc.

Two popular approaches to evaluate these LLM applications matching their flexibility with an equally flexible evalation is human-in-the-loop and LLM-as-a-judge. Human-in-the-loop relies on subject matter experts to label and verify whether the LLM application is correct. LLM-as-a-judge uses either the same LLM or another LLM to evaluate the performance of the application. Both approaches have their advantages (flexible) and shortcomings (cost and time).

LastMile AI combines the advantages of these approaches with a few other traditional ML techniques (active learning, synthetic data generation, and fine-tuning) to provide the best-in-class evaluators.

Guardrails

Guardrails is the testing and assessment of the quality of the LLM application's results in a live or production setting.

A general rule of thumb is Everything is harder with live data and in production. Guardrails act as the quality control for an LLM returning results in real-time.

Considerations for whether a guardrail can be used for an LLM application include:

Latency - can a guardrail give results in milliseconds without negatively impacting the user experience?
Consistency - is the guardrail dependable or will it give false positives or false negatives?
Scalability - can the guardrail scale for spikes in user traffic?

LastMile AI provides the only low-latency fine-tuned guardrails that can be used for production LLM applications.

Evaluation​

Guardrails​

Evaluation

Guardrails