Introduction

LastMile is the full-stack developer platform to debug, evaluate and improve LLM applications. We make it easy to fine-tune custom evaluators, set up guardrails & monitor app performance.

Developer quickstart

Compute your first evaluation metric within 5 minutes.

python
node.js

from lastmile.lib.auto_eval import AutoEval, Metric
import pandas as pd

result = AutoEval().evaluate_data(
  data=pd.DataFrame({
      "input": ["Where did the author grow up?"],
      "output": ["France"],
      "ground_truth": ["England"]
  }),
  metrics=[Metric(name="Faithfulness")]
)

print(f'Evlauation result:', result)

import { AutoEval, Metric, BuiltinMetrics } from "lastmile/lib/auto_eval";

const client = new AutoEval();
const result = await client.evaluateData(
  /*data*/ [
    {
      input: "Where did the author grow up?",
      output: "France",
      ground_truth: "England",
    },
  ],
  /*metrics*/ [BuiltinMetrics.FAITHFULNESS]
);

console.table(result);

Design your own metric

Use the fine-tuning service to design your own evaluators that represent custom criteria for your app quality.

1. Create Datasets>

Upload and manage application data for running and training evals, and generate synthetic labels.

2. Synthetic Labeling>

Generate high-quality labels for your data using LLM Judge with human-in-the-loop to refine synthetic labels.

3. Fine-tune Evaluators>

Use the AutoEval fine-tuning service to develop custom metrics for your application.

4. Run Evals>

Compute metrics by running high-performance inference using a prebuilt or fine-tuned model.

Out-of-the-box metrics

Batteries-included evaluation metrics covering common AI application types, such as RAG and multi-agent compound AI systems.

Faithfulness>

Measures how adherent or faithful an LLM response is to the provided context. Often used for hallucination detection.

Relevance>

Measures semantic similarity between two strings. Often used for context relevance, or input/output relevance, or similarity between a response and ground truth.

Summarization Quality>

Quantify the quality of a summarization response.

Toxicity>

Quantify the toxicity level in an LLM response.

More>

Explore other metrics available in AutoEval, or keep reading to design your own metric.

Meet alBERTa 🍁

alBERTa is a family of small language models (SLMs) designed for evaluation. They are optimized to be:

small -- 400M parameter entailment model
fast -- can run inference on CPU in < 300ms
customizable -- fine-tune for custom evaluation tasks

alBERTa-512 🍁>

512 token context, specialized for evaluation tasks (like faithfulness), and gives a numeric 0->1 score.

alBERTa-LC-8k 🍁>

Long-context window variant that can scale up to 128k tokens using a scaled dot-product attention layer

Introduction

Developer quickstart

Design your own metric

1. Create Datasets>

2. Synthetic Labeling>

3. Fine-tune Evaluators>

4. Run Evals>

Out-of-the-box metrics

Faithfulness>

Relevance>

Summarization Quality>

Toxicity>

More>

Meet alBERTa 🍁

alBERTa-512 🍁>

alBERTa-LC-8k 🍁>

Explore our guides

Quickstart>

Retrieval systems>

Real-time guardrails>

Developer quickstart

Design your own metric​

1. Create Datasets>

2. Synthetic Labeling>

3. Fine-tune Evaluators>

4. Run Evals>

Out-of-the-box metrics​

Faithfulness>

Relevance>

Summarization Quality>

Toxicity>

More>

Meet alBERTa 🍁​

alBERTa-512 🍁>

alBERTa-LC-8k 🍁>

Explore our guides​

Quickstart>

Retrieval systems>

Real-time guardrails>

Design your own metric

Out-of-the-box metrics

Meet alBERTa 🍁

Explore our guides