Skip to main content

Evaluator Models

AutoEval models for evaluation tasks

AutoEval ships with small language models (SLMs) optimized for evaluation tasks. The Metrics section lists all out-of-the-box metrics available on the platform.

In addition, if you need to design your own metric, you can fine-tune LastMile's alBERTa 🍁 base model to build a custom evaluator.

alBERTa 🍁

alBERTa is a 400M parameter BERT model that has been trained for NLI tasks and optimized for evaluation workfloads. It is particularly good at framing questions in the premise/hypothesis/entailment formulation, and returns a numeric probability score (0->1), which makes it particularly well-suited for computing metrics.

Key value props:

  • small -- 400M parameters small
  • fast -- can run inference on CPU in < 300ms
  • customizable -- fine-tune for custom evaluation tasks
  • self-hostable -- available for VPC deployment

There are two variants in the alBERTa model family:

modelcontext windowlatencyfine-tunedescription
alBERTa-512512 tokens<10ms512-token context variant, available for fine-tuning, and specialized for evaluation tasks (e.g. faithfulness)
alBERTa-LC-8k8192 tokens<400msLong-context window variant that can scale to 128k+ tokens using a scaled dot-product attention layer

alBERTa

Usage Guide

Running evals is possible both from the Model Console dashboard as well as the API.

Console

You can run a one-off evaluation from the model playground. Click any model in the Model Console. Click Run Model (the play button) to compute a score on some provided data.

Model Playground

info

Depending on what metric the model calculates, it will accept different input fields. For example, Faithfulness measures the faithfulness of the output to the ground_truth (i.e. context provided) given the input. On the other hand, Summarization measures the quality of the output summary given the input (no ground_truth needed).

API

Reference models by their name or id (both are available from the console. For example, the Faithfulness model can be referenced by its name, or its id cm2plr07q000ipkr4o8qhj4oe).

from lastmile.lib.auto_eval import AutoEval, Metric
import pandas as pd

client = AutoEval(api_token="api_token_if_LASTMILE_API_TOKEN_not_set")

query = "Where did the author grow up?"
expected_response = "England"
llm_response = "France"

# Evaluate data in a dataframe
data_result_df = client.evaluate_data(
data=pd.DataFrame({
"input": [query],
"output": [llm_response],
"ground_truth": [expected_response]
}),
metrics=[Metric(name="Faithfulness")]
)

# Evaluate data in a Dataset
dataset_result_df = client.evaluate_dataset(
dataset_id=dataset_id,
metrics=[Metric(id="cm2plr07q000ipkr4o8qhj4oe"), Metric(name="Summarization")]
)

You can reference any metric by its name as it appears in the Model Console. Accepted values include:

  • Metric(name="Faithfulness")
  • Metric(name="Relevance")
  • Metric(name="Summarization")
  • Metric(name="Toxicity")
  • Metric(name="Answer Correctness")
tip

You can use the same method to run inference on fine-tuned evaluator models. Simply refer to them by their name or id.

info

Since alBERTa 🍁 models are small and fast, you can run them online as guardrails. Learn more or follow this in-depth guide.