CASE STUDY

Athena

Multi-agent BI system that turns natural language into verified analytical insights

LangGraph GPT-4o Qdrant BM25 Streamlit LangSmith

10 weeks
·
Solo Architect & Builder
·
Production

The Problem

Business intelligence is broken for most organizations. Analysts spend 60% of their time finding data, not analyzing it. Self-serve BI tools promised democratization but delivered complexity — dashboards that only their creators understand, SQL queries that take days to write, and insights that arrive too late to act on.

The client needed a system where any stakeholder could ask a business question in plain English and get a verified, sourced analytical answer — not a hallucinated guess from a chatbot.

The Architecture

Athena is a multi-agent system built on LangGraph with 4 coordinated AI agents, each with a distinct role:

Agent 1 — The Router

Classifies incoming queries by intent, complexity, and required data sources. Routes simple lookups to the fast path, complex analytical questions to the full pipeline. This alone cuts median response time by 40% for routine questions.

Agent 2 — The Retriever

Hybrid search across structured and unstructured enterprise data using Qdrant (vector) + BM25 (keyword). The dual approach ensures high recall for exact-match queries (product IDs, dates) while maintaining semantic understanding for fuzzy natural language questions.

Agent 3 — The Analyst

Takes retrieved context and generates analytical insights using GPT-4o. This is not raw summarization — the agent applies analytical frameworks: trend detection, anomaly flagging, comparative analysis. Every claim is grounded in specific data points with citations.

Agent 4 — The Validator

Reviews the Analyst's output against source data. Checks numerical accuracy, flags unsupported claims, and adds confidence scores. This agent is the reason Athena's outputs are trustworthy — it catches the hallucinations before users see them.

Technical Decisions

Why LangGraph over CrewAI or AutoGen?
Explicit state management. LangGraph gives me a directed graph with typed state, conditional edges, and checkpoint persistence. When Agent 3 fails, I need deterministic retry logic — not an autonomous agent deciding what to do next. Production systems need predictable failure modes.

Why Qdrant + BM25 hybrid, not just vector search?
Vector search alone misses exact matches. When a user asks "Q3 2024 revenue for product SKU-4421," pure semantic search might return similar products. BM25 catches the exact SKU. The hybrid retriever uses reciprocal rank fusion to combine both result sets.

Why Streamlit for the frontend?
Speed to production. The client needed a working interface in weeks, not months. Streamlit let me ship a functional, polished UI while focusing engineering time on the agent pipeline — which is where the actual complexity lives.

Observability

Every agent interaction is traced end-to-end through LangSmith. I can see:

Token usage per agent per query
Retrieval relevance scores
Validator rejection rates and reasons
Latency breakdown by pipeline stage

This is not logging — it is structured observability that feeds back into system optimization. When retrieval quality drops for a specific data domain, I know within hours, not weeks.

Results

Coordinated AI Agents

94%

Retrieval Relevance

97%

Hallucinations Caught

<8s

Median Response Time