Evaluation how-to guides
These guides answer “How do I…?” format questions. They are goal-oriented and concrete, and are meant to help you complete a specific task. For conceptual explanations see the Conceptual guide. For end-to-end walkthroughs see Tutorials. For comprehensive descriptions of every class and function see the API reference.
Offline evaluation
Evaluate and improve your application before deploying it.
Run an evaluation
- Run an evaluation with the SDK
- Run an evaluation asynchronously
- Run an evaluation comparing two experiments
- Evaluate a
langchain
runnable - Evaluate a
langgraph
graph - Evaluate an existing experiment (Python only)
- Run an evaluation from the UI
- Run an evaluation via the REST API
Define an evaluator
- Define a custom evaluator
- Define an LLM-as-a-judge evaluator
- Define a pairwise evaluator
- Define a summary evaluator
- Use an off-the-shelf evaluator via the SDK (Python only)
- Evaluate an application's intermediate steps
- Return multiple metrics in one evaluator
- Return categorical vs numerical metrics
Configure the evaluation data
Configure an evaluation job
- Evaluate with repetitions
- Handle model rate limits
- Print detailed logs (Python only)
- Run an evaluation locally (beta, Python only)
Unit testing
Unit test your system to identify bugs and regressions.
Online evaluation
Evaluate and monitor your system's live performance on production data.
Automatic evaluation
Set up evaluators that automatically run for all experiments against a dataset.
Analyzing experiment results
Use the UI & API to understand your experiment results.
- Compare experiments with the comparison view
- Filter experiments
- View pairwise experiments
- Fetch experiment results in the SDK
- Upload experiments run outside of LangSmith with the REST API
Dataset management
Manage datasets in LangSmith used by your evaluations.
- Create a dataset from the UI
- Export a dataset from the UI
- Create a dataset split from the UI
- Filter examples from the UI
- Create a dataset with the SDK
- Fetch a dataset with the SDK
- Update a dataset with the SDK
- Version a dataset
- Share/unshare a dataset publicly
- Export filtered traces from an experiment to a dataset
Annotation queues and human feedback
Collect feedback from subject matter experts and users to improve your applications.