Skip to main content

Developer quickstart

Let's get you started with AutoEval

The AutoEval platform is accessible both via UI and API, and provides LLM judge labeling, evaluator model fine-tuning, and inference for evals and guardrails. It is the only service to let you design custom metrics by fine-tuning evaluator models.

Let's walk you through the basics of the platform.

Account Setup

First, you'll need a LastMile AI token to get started.

  1. Sign up for a free LastMile AI account
  2. Generate an API key in the dashboard

LastMile API Key

Set API Key

Export the API key you generated as an environment variable. Alternatively, you can also pass it directly into the Lastmile client.

export LASTMILE_API_TOKEN="your_api_key"

Run your first Evals

Let's evaluate an AI application interaction now. All interactions have some combination of:

  • input: Input to the application (e.g. a user question for a Q&A system)
  • output: The response generated by the application (e.g. LLM generation)
  • ground_truth: Factual data, either the ideal correct response, or context used to respond (e.g. data retrieved from a vector DB)

Model Console

Navigate to the Model Console in the AutoEval dashboard, and click on the Faithfulness model. Enter values for input, output and ground_truth, and click Run Model to calculate a score:

Model Playground

API

We provide a REST API surface as well as client SDKs in Python and Node.js. We recommend using the client SDKs for a streamlined experience. For more information, see the API section.

First, let's install the requisite packages:

pip install lastmile
from lastmile.lib.auto_eval import AutoEval, BuiltinMetrics
import pandas as pd

client = AutoEval(api_token="api_token_if_LASTMILE_API_TOKEN_not_set")

query = "Where did the author grow up?"
expected_response = "England"
llm_response = "France"

eval_result = client.evaluate_data(
data=pd.DataFrame({
"input": [query],
"output": [llm_response],
"ground_truth": [expected_response]
}),
metrics=[BuiltinMetrics.FAITHFULNESS]
)

print(eval_result)

You can reference any metric by its name as it appears in the Model Console. Accepted values include:

  • Metric(name="Faithfulness")
  • Metric(name="Relevance")
  • Metric(name="Summarization")
  • Metric(name="Toxicity")
  • Metric(name="Answer Correctness")

Congrats on running your first evaluation! The platform can do a lot more, so head on to these sections or guides below to continue your AutoEval journey.

tip

Don't have app data handy? No problem - check out our Example Datasets to get synthetic datasets to try out the platform.