Evaluating LLM Outputs at Scale with Python

Why Evaluation Matters

Vibe-checking model outputs doesn’t scale. When you’re running multi-agent pipelines that execute thousands of LLM calls per day, you need reproducible, automated evaluation — not just spot checks.

The Harness

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")

test_case = LLMTestCase(
    input="What is PgBouncer used for?",
    actual_output=agent_response,
    retrieval_context=retrieved_chunks,
)

evaluate([test_case], [metric])

Metrics I Track

Metric	Tool	What It Catches
Answer Relevancy	DeepEval	Off-topic hallucinations
Faithfulness	DeepEval	Claims not grounded in context
Latency P95	OpenLit	Slow model/node bottlenecks
Token cost	OpenLit	Budget drift across runs
Groundedness	Custom rubric	Domain-specific factual accuracy

OpenLit Integration

import openlit

openlit.init(otlp_endpoint="http://localhost:4318")
# All downstream LLM calls are now auto-instrumented

Every call in the agent graph gets a span with model name, prompt tokens, completion tokens, latency, and cost — queryable from Grafana.

Key Insight

Faithfulness degrades significantly when retrieved chunk size exceeds ~800 tokens with GPT-4o-mini. Chunking strategy has more impact on RAG quality than model choice in most real-world workloads.

Evaluating LLM Outputs at Scale with Python

Why Evaluation Matters

The Harness

Metrics I Track

OpenLit Integration

Key Insight

Related Posts

Arachnid Journey: Building a Spider Exploration Game in UE5

Self-Hosting Everything: Dokploy + Cloudflare Tunnels on a Home Server

Empowering Autistic Children Through AI: Redesigning EmoGalaxy