Open Source

Cherry Evals

Search, cherry-pick, and export examples from public AI evaluation datasets. Build custom eval suites for your models — from the terminal, the API, or your agent.

View on GitHub Try the API

The Problem

Stop writing one-off scripts for every dataset.

Building custom eval suites shouldn't mean writing one-off scripts for every dataset and format. Cherry Evals gives you a unified interface to search, curate, and export from any supported benchmark — so you can focus on what actually matters: evaluating your models.

Search

Keyword, semantic, and hybrid search across multiple benchmark datasets — MMLU, HumanEval, GSM8K, HellaSwag, TruthfulQA, ARC, and more. Find exactly the examples you need.

Cherry-pick

Curate custom collections by selecting individual examples from search results. Build targeted eval suites for specific skills, domains, or difficulty levels.

Export

Download collections as JSON, JSONL, or CSV. Push directly to Langfuse for tracing and evaluation. Your data, your format.

REST API

FastAPI-powered HTTP endpoints for programmatic access. Integrate search and export into any pipeline or workflow.

CLI

Full-featured command-line interface built with Click. Ingest datasets, run searches, and manage collections from your terminal.

MCP

MCP Server

Model Context Protocol server for AI agents. Let Claude, GPT, or any MCP-compatible agent search and curate evals autonomously.

Web UI

React-based interface for interactive search and curation. Browse results visually, build collections, and export with a click.

MMLU HumanEval GSM8K HellaSwag TruthfulQA ARC WinoGrande PIQA MBPP BoolQ + any HuggingFace dataset

            
bash — try the cloud API

          # Search for biology questions across all datasets
curl -s -X POST https://cherry-evals-api-480090132755.europe-north1.run.app/search \
  -H "Content-Type: application/json" \
  -d '{"query": "photosynthesis", "limit": 5}' | python3 -m json.tool

# List all available datasets
curl -s https://cherry-evals-api-480090132755.europe-north1.run.app/datasets | python3 -m json.tool

# Create a collection and start cherry-picking
curl -s -X POST https://cherry-evals-api-480090132755.europe-north1.run.app/collections \
  -H "Content-Type: application/json" \
  -d '{"name": "my-bio-eval", "description": "Biology eval suite"}'

            
bash — self-hosted

          # Clone the repository
git clone https://github.com/marinone94/cherry-evals.git
cd cherry-evals

# Install dependencies (requires uv)
uv sync

# Start Postgres + Qdrant
docker compose up -d

# Run database migrations
uv run alembic upgrade head

# Ingest the MMLU benchmark dataset
uv run python -m cherry_evals.cli ingest mmlu

# Generate embeddings for semantic search
uv run python -m cherry_evals.cli embed mmlu

# Start the API server
uv run fastapi dev api/main.py

The API is now available at http://localhost:8000. See the README for full configuration options.

Cherry Evals

Stop writing one-off scripts for every dataset.

Everything you need to build eval suites.

Search

Cherry-pick

Export

Use it however you work.

REST API

CLI

MCP Server

Web UI

All major benchmarks, one interface.

Try it right now.

Or self-host it.

🍒 Cherry Evals

Stop writing one-off scripts for every dataset.

Search

Cherry-pick

Export

REST API

CLI

MCP Server

Web UI

Cherry Evals