Evaluator Fine-tuning
Build your own evaluation metric by fine-tuning alBERTa evaluator models
The AutoEval fine-tuning service enables you to develop models that represent your custom evaluation criteria. The flow is roughly as follows:
- Start with a Dataset of your application data.
- (Assuming you don't have labels) Specify your evaluation criteria and generate labels with LLM Judge++.
- Use the AutoEval fine-tuning service to train a fine-tuned evaluator model using the labeled data to learn the evaluation criteria.
The fine-tuned model will then produce a numeric 0->1 probability score for every request.
Some example evaluators that AutoEval customers have trained:
- A custom response quality metric that includes succinctness, clarity, and accuracy.
- A custom correctness metric for tool use (function calling) in a multi-agent system.
- A custom brand tone metric to measure LLM response adherence to company's brand tone rubric.
Why fine-tune?
Fine-tuning your own alBERTa evaluator model is much more effective than LLM Judge style eval approaches for a number of reasons. Unlike an LLM Judge, this model can be small, fast and include human-in-the-loop refinement:
- better metric quality -- the quality of evaluations is bound by quality of the labels (which can be refined with human feedback), not bound by an LLM's ability to understand your application context.
- small & fast -- 100x faster than LLM Judge (can run inference in 10-300ms), which allows these evaluators to run online as guardrails.
- customizable -- the model is simply learning the label distribution, making it easy to teach it any kind of classification task.
You can fine-tune as many evaluators as you want – one for each evaluation criteria you need for your application.
Usage Guide
Upload Datasets
Create a Dataset containing the application trace data to use for fine-tuning.
We recommend at least a few hundred examples (256 minimum). A few thousand examples is ideal.
During the fine-tuning flow you can choose which of these columns to include in the training. Make sure that the data contains at least one of the following columns:
input
output
ground_truth
Don't have app data handy? No problem - check out our Example Datasets to get synthetic datasets to try out the platform.
Label data
Use the LLM Judge++ flow to define your evaluation criteria and generate labels.
Training and test data split (optional)
Split the dataset into a training dataset and a test (holdout) dataset.
We recommend an 80/20 or 90/10 split.
You can do this via API:
- python
from lastmile.lib.auto_eval import AutoEval, Metric
from sklearn.model_selection import train_test_split
import pandas as pd
client = AutoEval(api_token="api_token_if_LASTMILE_API_TOKEN_not_set")
dataset_df = client.download_dataset(dataset_id=dataset_id) # Labeled dataset from previous step
# Split the data into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', index=False)
test_dataset_id = client.upload_dataset(
file_path="test.csv", # Your test dataset file
name="My Test Dataset for Fine-tuning",
description="Test dataset for evaluating the fine-tuned model"
)
train_dataset_id = client.upload_dataset(
file_path="train.csv",
name="My Training Dataset for Fine-tuning",
description="Training dataset for the fine-tuned model"
)
Fine-tune model
At this point we have labeled application data to train, and test the model. Time to fine-tune!
API
- python
# Continuation from previous step
fine_tune_job_id = client.fine_tune_model(
train_dataset_id=train_dataset_id,
test_dataset_id=test_dataset_id,
model_name="My Custom Evaluation Metric",
selected_columns=["input", "output", "ground_truth"], # You can decide which of these columns to include in the training data
wait_for_completion=False # Set to true for blocking
)
print(f"Fine-tuning job initiated with ID: {fine_tune_job_id}. Waiting for completion...")
client.wait_for_fine_tune_job(fine_tune_job_id)
print(f"Fine-tuning job completed with ID: {fine_tune_job_id}")
UI
-
Navigate to Model Console and click Fine-Tune a Model.
-
Fill out the Fine-tuning form and click Submit to start the training job.
-
Track progress in the Model Console, including training metrics such as loss/accuracy.
infoSet up your Weights & Biases API key to have the training data logged to your own W&B account if you prefer.
Use fine-tuned model
Once the model is trained and deployed on the inference server, it will be listed as 🟢 Online in the dashboard.
You can try it out directly in the playground, and see its training metrics in the Fine-Tune Info tab.
API
Use the API to run evals using your new model
- python
fine_tuned_metric = Metric(name="My Custom Evaluation Metric") # Reference the fine-tuned model by name or id
# Run evals on your test/holdout dataset to see how the model is performing
test_results_df = client.evaluate_dataset(test_dataset_id, fine_tuned_metric)
# Run evals on any application data
eval_results_df = client.evaluate_data(
# data should include the columns that the model expects
data=pd.DataFrame({
"input": ["What is the meaning of life?"],
"output": ["42"],
"ground_truth": ["Life, universe and everything"]
}),
metrics=[fine_tuned_metric],
)
Weights & Biases integration
AutoEval allows you to track detailed training runs in your own Weights & Biases account. To do so, navigate to the API Keys console, and save your W&B API key. Subsequent fine-tuning runs will be tracked in your account.