· AI & Agents  · 1 min read

Evaluating LLM Outputs at Scale with Python

A practical walkthrough of the evaluation harness I built to benchmark LLM response quality, latency, and cost across multiple models — using DeepEval, custom rubric scorers, and OpenLit for observability.

A practical walkthrough of the evaluation harness I built to benchmark LLM response quality, latency, and cost across multiple models — using DeepEval, custom rubric scorers, and OpenLit for observability.

Why Evaluation Matters

Vibe-checking model outputs doesn’t scale. When you’re running multi-agent pipelines that execute thousands of LLM calls per day, you need reproducible, automated evaluation — not just spot checks.

The Harness

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")

test_case = LLMTestCase(
    input="What is PgBouncer used for?",
    actual_output=agent_response,
    retrieval_context=retrieved_chunks,
)

evaluate([test_case], [metric])

Metrics I Track

MetricToolWhat It Catches
Answer RelevancyDeepEvalOff-topic hallucinations
FaithfulnessDeepEvalClaims not grounded in context
Latency P95OpenLitSlow model/node bottlenecks
Token costOpenLitBudget drift across runs
GroundednessCustom rubricDomain-specific factual accuracy

OpenLit Integration

import openlit

openlit.init(otlp_endpoint="http://localhost:4318")
# All downstream LLM calls are now auto-instrumented

Every call in the agent graph gets a span with model name, prompt tokens, completion tokens, latency, and cost — queryable from Grafana.

Key Insight

Faithfulness degrades significantly when retrieved chunk size exceeds ~800 tokens with GPT-4o-mini. Chunking strategy has more impact on RAG quality than model choice in most real-world workloads.

Back to Blog

Related Posts

View All Posts »