Open Source

Cherry Evals

Search, cherry-pick, and export examples from public AI evaluation datasets. Build custom eval suites for your models — from the terminal, the API, or your agent.

Stop writing one-off scripts for every dataset.

Building custom eval suites shouldn't mean writing one-off scripts for every dataset and format. Cherry Evals gives you a unified interface to search, curate, and export from any supported benchmark — so you can focus on what actually matters: evaluating your models.

Everything you need to build eval suites.

Search

Keyword, semantic, and hybrid search across multiple benchmark datasets — MMLU, HumanEval, GSM8K, HellaSwag, TruthfulQA, ARC, and more. Find exactly the examples you need.

Cherry-pick

Curate custom collections by selecting individual examples from search results. Build targeted eval suites for specific skills, domains, or difficulty levels.

Export

Download collections as JSON, JSONL, or CSV. Push directly to Langfuse for tracing and evaluation. Your data, your format.

Use it however you work.

Cherry Evals is built for both humans and AI agents. Every capability is available through all four interfaces.

REST API

REST API

FastAPI-powered HTTP endpoints for programmatic access. Integrate search and export into any pipeline or workflow.

CLI

CLI

Full-featured command-line interface built with Click. Ingest datasets, run searches, and manage collections from your terminal.

MCP

MCP Server

Model Context Protocol server for AI agents. Let Claude, GPT, or any MCP-compatible agent search and curate evals autonomously.

Web UI

Web UI

React-based interface for interactive search and curation. Browse results visually, build collections, and export with a click.

All major benchmarks, one interface.

Ingest any of the supported datasets and search across them uniformly. More datasets added regularly.

MMLU HumanEval GSM8K HellaSwag TruthfulQA ARC More coming soon

Up and running in minutes.

Self-hosted, no account required. Requires uv and Docker Compose.

bash
# Clone the repository
git clone https://github.com/marinone94/cherry-evals.git
cd cherry-evals

# Install dependencies (requires uv)
uv sync

# Start Postgres + Qdrant
docker compose up -d

# Run database migrations
uv run alembic upgrade head

# Ingest the MMLU benchmark dataset
uv run python -m cherry_evals.cli ingest mmlu

# Generate embeddings for semantic search
uv run python -m cherry_evals.cli embed mmlu

# Start the API server
uv run fastapi dev api/main.py

The API is now available at http://localhost:8000. See the README for full configuration options.