Evaluations
Evaluations execute a task over a dataset and emit structured events, samples, and a final result.
Key concepts
Section titled “Key concepts”Evaluationorchestrates execution of a task against a dataset.@dn.evaluationwraps a task into an Evaluation.EvalEvent(EvalStart,EvalSample,EvalEnd) streams progress.Sampleholds per-row input/output/metrics.EvalResultaggregates metrics, pass/fail stats, and stop reasons.
Create an evaluation with the decorator
Section titled “Create an evaluation with the decorator”import dreadnode as dnfrom dreadnode.scorers import contains
@dn.evaluation( dataset=[ {"question": "What is Dreadnode?", "expected": "agent platform"}, {"question": "What is an evaluation?", "expected": "dataset-driven"}, ], scorers=[contains("agent platform")], assert_scores=["contains"], concurrency=4, max_errors=2,)async def answer(question: str, expected: str) -> str: return f"{question} -> {expected}"
result: dn.EvalResult = await answer.run()print(result.pass_rate, len(result.samples))Load from a dataset file with preprocessing
Section titled “Load from a dataset file with preprocessing”dataset_file accepts JSONL, CSV, JSON, or YAML. Use preprocessor to normalize
data before scoring, and dataset_input_mapping to align dataset keys with task params.
from pathlib import Pathimport dreadnode as dn
def normalize(rows: list[dict[str, str]]) -> list[dict[str, str]]: return [{"prompt": row["prompt"].strip()} for row in rows]
evaluation = dn.Evaluation( task="my_project.tasks.generate_answer", dataset_file=Path("data/eval.jsonl"), dataset_input_mapping={"prompt": "question"}, preprocessor=normalize, concurrency=8,)
result = await evaluation.run()Stream events during execution
Section titled “Stream events during execution”import dreadnode as dn
async with evaluation.stream() as events: async for event in events: if isinstance(event, dn.EvalEvent): print(type(event).__name__)