Experimentation
Track and manage experiments on your LLMs and applications to compare performance, test changes, and validate improvements.
What is an Experiment?
An Experiment is a collection of evaluation runs. Each evaluation run consists of the dataset and metrics for that run. For example, let's say I have a dataset of customer support chatbot questions, answers, and context. For my Customer Support Experiment, I can run this dataset against AutoEval's relevancy, toxicity, and faithfulness metrics as an evaluation run. Next, I can update the model (let's say to Gemini) for the chatbot and generate new answers and context for the same set of questions. Then, I can save that as another evaluation run and compare the results to determine if the model change was a positive improvement.
Experiments enable you to confidently make iterative changes to your LLM application in a structured and organzied way.
What types of changes can Experiments measure?
Anything that influences the LLM application's performance is measureable through an experiment, such as:
- Updates to the LLM model, such as a new training date
- Modifying the retrieval strategy for a RAG system
- Adjusting system prompts for an agent
- And more
Here are the recommended metadata to track for your LLM application:
model version | chunk_size | chunk_strategy | top_k_retrieval | dataset | temperature | |
---|---|---|---|---|---|---|
Example 1 | gpt-4o-2024-08-06 | 1024 | sliding window | 5 | question-answer-v1 | 0 |
Example 2 | gemini-2.0-flash-001 | 512 | no overlap | 5 | qusetion-answer-v1 | 0 |
For saving metadata around the model, make sure to include the entire version, which includes the cutoff of the train date. Model providers are constantly updating the latest model with new data, as such it's a good practice to pin the model version on the one that works best for you.
Usage Guide
This guide walks through the process of setting up and running experiments using AutoEval, including:
- Setting up the API key and creating a project
- Preparing and uploading a dataset
- Creating an Experiment
- Evaluating the dataset against default metrics, logging results, and iterating on changes
- Visualizing the results in the Experiments Console
1. Set Up AutoEval Client
Before running experiments, ensure you have the latest version of AutoEval:
pip install lastmile --upgrade
Authenticate with the LastMile AI API
To interact with the LastMile AI API, set your API key as an environment variable.
📌 Tip: If you don't have an API key yet, visit the LastMile AI dashboard, navigate to the API section on the sidebar, and copy your key.
import os
api_token = "YOUR_API_KEY_HERE"
if not api_token:
print("Error: Please set your API key in the environment variable LASTMILE_API_KEY")
else:
print("✓ API key successfully configured!")
Set Up AutoEval Client
Once authenticated, initialize the AutoEval client:
from lastmile.lib.auto_eval import AutoEval
client = AutoEval(api_token=api_token) # Optionally set project_id to scope to a specific project
2. Create a Project or Select an Existing Project
A Project is the container that organizes your Experiments, Evaluation runs, and Datasets. It typically corresponds to the AI initiative, application, or use case you’re building.
Projects help keep evaluations structured, especially when managing multiple experiments across different AI models or applications. You can create new projects or use existing ones.
To create a new project programmatically, use:
project = client.create_project(
name="Example Customer Agent Project",
description="Example project to evaluate customer support agents"
)
client.project_id = project.id
3. Prepare and Upload Your Dataset
You can either use an existing dataset or upload a new dataset to run an evaluation within your new experiment. For an existing dataset, you can skip to Step 4.
Uploading a new dataset
LastMile AI AutoEval expects a CSV file with the following columns:
input
: The user's query or input textoutput
: The assistant's response to the user's queryground_truth
(optional): The correct or expected response for comparison. This can also be the context for metrics like faithfulness
Uploading this dataset allows you to evaluate how well the assistant's responses align with the ground truth using LastMile AI’s evaluation metrics.
To upload your dataset, use the following code:
dataset_csv_path = "ADD_YOUR_DATASET_HERE"
dataset_id = client.upload_dataset(
file_path=dataset_csv_path,
name="NAME_OF_YOUR_DATASET",
description="DESCRIPTION_OF_DATASET"
)
print(f"Dataset created with ID: {dataset_id}")
4. Create an Experiment
To create an experiment, use the following code:
The experiment's metadata should be used to track important information (or parameters) in regards to the application that is being tested. Important metadata to track include the LLM being used, the LLM parameters, prompt version, dataset version, the application, etc.
experiment = client.create_experiment(
name="EXPERIMENT_NAME",
description="EXPERIMENT_DESCRIPTION",
metadata={
"model": "gpt-4o",
"temperature": 0.8,
"misc": {
"dataset_version": "0.1.1",
"app": "customer-support"
}
}
)
5. Evaluate the Dataset Against Built-in Metrics
You can specify custom and fine-tuned metrics to be evaluated for that dataset.
from lastmile.lib.auto_eval import BuiltinMetrics
default_metrics = [
BuiltinMetrics.FAITHFULNESS,
BuiltinMetrics.RELEVANCE,
]
print("Evaluation job kicked off")
evaluation_results = client.evaluate_dataset(
dataset_id=dataset_id,
metrics=default_metrics,
experiment_id=experiment.id,
metadata={"extras": "Base metric tests"}
)
print("Evaluation Results:")
evaluation_results.head(10)
6. Visualize in the Experiments Console
📊 Explore your results in the AutoEval UI:
- 🔬 Experiments Overview: View all experiments
- 📈 Evaluation Runs: See all evaluation runs
- 🏢 Project Dashboard: Manage projects and experiments
- 📂 Dataset Library: Browse and manage uploaded datasets
Check out the full cookbook for expanded details on the functionality