Introduction
LastMile is the full-stack developer platform to debug, evaluate and improve LLM applications. We make it easy to fine-tune custom evaluators, set up guardrails & monitor app performance.
- python
- node.js
from lastmile import LastMile;
LastMile.eval("Hello world")
const { LastMile } = require('lastmile');
LastMile.eval("Hello world");
Meet alBERTa 🍁
alBERTa is a family of small language models designed for evaluation. They are optimized to be:
- small -- 400M parameter entailment model
- fast -- can run inference on CPU in < 300ms
- customizable -- fine-tune for custom evaluation tasks
alBERTa-512 🍁>
2048 token context, specialized for evaluation tasks (like faithfulness), and gives a numeric 0->1 score.
alBERTa-LC-8k 🍁>
Long-context window variant that can scale to 128k+ tokens using a scaled dot-product attention layer
Out-of-the-box metrics
Faithfulness>
Measures how adherent or faithful an LLM response is to the provided context. Often used for hallucination detection.
Semantic Similarity>
Measures semantic similarity between two strings. Often used for context relevance, or input/output relevance, or similarity between a response and ground truth.
Summarization Quality>
Quantify the quality of a summarization response.
Toxicity>
Quantify the toxicity level in an LLM response.
More>
Explore other metrics available in AutoEval, or keep reading to design your own metric.
Design your own metric
1. Create Datasets>
Upload and manage application data for running and training evals, and generate synthetic labels.
2. LLM Judge Active Labeling>
Generate high-quality labels for your data using LLM Judge with human-in-the-loop
3. Fine-tune Models>
Use the AutoEval fine-tuning service to develop custom metrics for your application.
4. Run Evals>
Compute metrics by running high-performance inference on a prebuilt or fine-tuned model.