Skip to main content
Back to Work
CASE STUDY

Sentinel

AI observability platform that monitors agent reliability and catches drift

Pydantic AI LangFuse FastAPI PostgreSQL
  • 6 weeks
  • Solo Architect & Builder
  • Production

The Problem

AI systems degrade silently. A model that performed at 95% accuracy last month might be at 78% today — and nobody notices until customers complain. Most teams monitor infrastructure (CPU, memory, uptime) but not AI behavior (response quality, hallucination rates, retrieval relevance).

The gap is observability. Not infrastructure monitoring — AI-specific observability that tracks what matters: is the system still giving good answers?


The Architecture

Sentinel is an observability platform purpose-built for AI agent pipelines. It connects to any LLM-based system via lightweight instrumentation and provides three layers of monitoring:

Layer 1 — Trace Collection

Every LLM call, tool invocation, and agent decision is captured as a structured trace through LangFuse integration. Traces include full input/output pairs, latency, token usage, and custom metadata. The instrumentation adds less than 5ms overhead per call.

Layer 2 — Quality Evaluation

Automated evaluators run on sampled traces using Pydantic AI for structured output parsing. Evaluators check: factual grounding, response relevance, instruction adherence, and format compliance. Each evaluation produces a typed score that feeds into trend analysis.

Layer 3 — Alerting & Dashboards

When quality metrics cross configurable thresholds, Sentinel fires alerts. Not just "something is wrong" — specific alerts like "retrieval relevance for the finance domain dropped below 85% over the last 4 hours." Dashboards built on FastAPI endpoints with real-time WebSocket updates.


Technical Decisions

Why Pydantic AI for evaluations?
Structured, typed outputs. When an evaluator assesses response quality, I need a typed score object — not a string that might say "7/10" or "seven out of ten" or "pretty good." Pydantic AI enforces output schemas, making downstream aggregation reliable.
Why LangFuse over LangSmith for this project?
Open source and self-hostable. Sentinel is designed for enterprises that cannot send trace data to third-party SaaS. LangFuse runs in the client's infrastructure with full data sovereignty.
Why PostgreSQL?
Trace data is relational at its core — spans belong to traces, traces belong to sessions, sessions belong to users. PostgreSQL handles this naturally with JSONB columns for flexible metadata and proper indexing for time-series queries.

Results

<5ms
Instrumentation Overhead
100%
Trace Coverage
3
Monitoring Layers
Self-Hostable