Introduction
LastMile is the full-stack developer platform to debug, evaluate and improve LLM applications. We make it easy to fine-tune custom evaluators, set up guardrails & monitor app performance.
- python
- node.js
from lastmile.lib.auto_eval import AutoEval, Metric
import pandas as pd
result = AutoEval().evaluate_data(
data=pd.DataFrame({
"input": ["Where did the author grow up?"],
"output": ["France"],
"ground_truth": ["England"]
}),
metrics=[Metric(name="Faithfulness")]
)
print(f'Evlauation result:', result)
import { AutoEval, Metric, BuiltinMetrics } from "lastmile/lib/auto_eval";
const client = new AutoEval();
const result = await client.evaluateData(
/*data*/ [
{
input: "Where did the author grow up?",
output: "France",
ground_truth: "England",
},
],
/*metrics*/ [BuiltinMetrics.FAITHFULNESS]
);
console.table(result);
Design your own metric
Use the fine-tuning service to design your own evaluators that represent custom criteria for your app quality.
1. Create Datasets>
Upload and manage application data for running and training evals, and generate synthetic labels.
2. Synthetic Labeling>
Generate high-quality labels for your data using LLM Judge with human-in-the-loop to refine synthetic labels.
3. Fine-tune Evaluators>
Use the AutoEval fine-tuning service to develop custom metrics for your application.
4. Run Evals>
Compute metrics by running high-performance inference using a prebuilt or fine-tuned model.
Out-of-the-box metrics
Batteries-included evaluation metrics covering common AI application types, such as RAG and multi-agent compound AI systems.
Faithfulness>
Measures how adherent or faithful an LLM response is to the provided context. Often used for hallucination detection.
Relevance>
Measures semantic similarity between two strings. Often used for context relevance, or input/output relevance, or similarity between a response and ground truth.
Summarization Quality>
Quantify the quality of a summarization response.
Toxicity>
Quantify the toxicity level in an LLM response.
More>
Explore other metrics available in AutoEval, or keep reading to design your own metric.
Meet alBERTa 🍁
alBERTa is a family of small language models (SLMs) designed for evaluation. They are optimized to be:
- small -- 400M parameter entailment model
- fast -- can run inference on CPU in < 300ms
- customizable -- fine-tune for custom evaluation tasks
alBERTa-512 🍁>
512 token context, specialized for evaluation tasks (like faithfulness), and gives a numeric 0->1 score.
alBERTa-LC-8k 🍁>
Long-context window variant that can scale up to 128k tokens using a scaled dot-product attention layer
Explore our guides
Quickstart>
Start-to-finish overview of AutoEval, from running evals, labeling with LLM Judge to fine-tuning a custom metric.
Retrieval systems>
Evaluate a RAG application for hallucination, relevance and other out-of-the-box metrics available via AutoEval.
Real-time guardrails>
Build real-time guardrails in a RAG application using fine-tuned alBERTa 🍁 models.