Introduction
LastMile is the full-stack developer platform to debug, evaluate and improve LLM applications. We make it easy to fine-tune custom evaluators, set up guardrails & monitor app performance.
- python
- node.js
from lastmile.lib.auto_eval import AutoEval, Metric
import pandas as pd
result = AutoEval().evaluate_data(
data=pd.DataFrame({
"input": ["Where did the author grow up?"],
"output": ["France"],
"ground_truth": ["England"]
}),
metrics=[Metric(name="Faithfulness")]
)
print(f'Evlauation result:', result)
import {Lastmile, Metric} from 'lastmile';
const client = new Lastmile();
const response = await client.evaluation.evaluate({
input: ["Where did the author grow up?"],
output: ["France"],
groundTruth: ["England"]
metric: Metric(name: "Faithfulness")
});
Design your own metric
Use the fine-tuning service to design your own evaluators that represent custom criteria for your app quality.
1. Create Datasets>
Upload and manage application data for running and training evals, and generate synthetic labels.
2. LLM Judge Active Labeling>
Generate high-quality labels for your data using LLM Judge with human-in-the-loop
3. Fine-tune Models>
Use the AutoEval fine-tuning service to develop custom metrics for your application.
4. Run Evals>
Compute metrics by running high-performance inference on a prebuilt or fine-tuned model.
Out-of-the-box metrics
Batteries-included evaluation metrics covering common AI application types, such as RAG and multi-agent compound AI systems.
Faithfulness>
Measures how adherent or faithful an LLM response is to the provided context. Often used for hallucination detection.
Semantic Similarity>
Measures semantic similarity between two strings. Often used for context relevance, or input/output relevance, or similarity between a response and ground truth.
Summarization Quality>
Quantify the quality of a summarization response.
Toxicity>
Quantify the toxicity level in an LLM response.
More>
Explore other metrics available in AutoEval, or keep reading to design your own metric.
Meet alBERTa 🍁
alBERTa is a family of small language models (SLMs) designed for evaluation. They are optimized to be:
- small -- 400M parameter entailment model
- fast -- can run inference on CPU in < 300ms
- customizable -- fine-tune for custom evaluation tasks
alBERTa-512 🍁>
2048 token context, specialized for evaluation tasks (like faithfulness), and gives a numeric 0->1 score.
alBERTa-LC-8k 🍁>
Long-context window variant that can scale to 128k+ tokens using a scaled dot-product attention layer
Explore our guides
Quickstart>
Start-to-finish overview of AutoEval, from running evals, labeling with LLM Judge to fine-tuning a custom metric.
Retrieval systems>
Evaluate a RAG application for hallucination, relevance and other out-of-the-box metrics available via AutoEval.
Real-time guardrails>
Build real-time guardrails in a RAG application using fine-tuned alBERTa 🍁 models.