You type a message in Slack at 2 AM: "prod is down, anyone know why?"
You're staring at five browser tabs — Datadog, Kibana, PagerDuty, Grafana, and a Slack thread that's moving faster than you can read. What if the first hour of investigation could happen automatically, while you were still pouring your coffee?
That's the problem I set out to solve. I built an AI Incident Response Platform — a system that ingests incident data, invokes a multi-agent AI workflow, and automatically produces structured Root Cause Analysis: what broke, how to fix it, and how to prevent it next time.
In this four-part series, I'm going to walk you through every architectural decision — from the first API call to the final confidence score in the database.
But first: let's understand exactly why the current state of incident response is broken, and what a smarter system should look like.
01 —The State of Modern Incident Response
Ask any on-call engineer what happens when an alert fires, and you'll hear some version of the same story.
Step 1: PagerDuty wakes you up. The alert says something vague like HighErrorRate or P99LatencyExceeded.
Step 2: You open Datadog and filter logs for the last 30 minutes. You're looking for red lines in a sea of noise.
Step 3: You hop into Grafana to check CPU and memory. Nothing obvious.
Step 4: You check recent deployments. A new version went out 2 hours ago. Maybe that?
Step 5: You write a message in Slack: "I think it might be the new cart-service deployment. Investigating."
Step 6: 45 minutes later, you've confirmed it. You roll back. The alert clears. You write a postmortem at 4 AM.
Sound familiar?
The problem isn't that the tools are bad. Datadog, Grafana, Kibana — they're excellent at what they do. The problem is context fragmentation. The signal that points to the root cause is always there, scattered across multiple observability platforms. But a human has to manually pull it all together, in real time, under pressure, often running on zero sleep.
This is precisely the kind of work that structured AI reasoning excels at.
02 —Why LLMs Are a Natural Fit for Incident Diagnosis
At their core, LLMs are pattern-matching machines. They've been trained on enormous corpora of technical documentation, Stack Overflow posts, GitHub issues, and engineering postmortems. When you hand an LLM a set of error logs, deployment metadata, and a severity label — and ask it to reason about root cause — it can connect dots across multiple signals in a way that would take a human significantly longer.
But here's the critical insight that most people miss: you can't just dump logs into ChatGPT and call it done.
There are four hard problems that a production-ready AI incident system has to solve:
| Problem | Why It Matters |
|---|---|
| Latency | LLM inference can take 30–60 seconds. Your incident API cannot block that long. |
| Privacy | Production logs often contain infrastructure secrets, PII, or proprietary data. Sending them to a public API is a non-starter. |
| Structure | Engineers need structured output — root cause, mitigation, prevention — not a chatbot paragraph. |
| Reliability | LLMs sometimes return malformed JSON. The system has to handle this gracefully without crashing. |
Each of these problems needs a deliberate architectural decision. And together, they define the shape of the entire system.
03 —The Architecture: A Bird's Eye View
Here's the high-level picture of what I built.
The key insight in this design is the strict separation between the synchronous API and the asynchronous AI pipeline. When an incident is created, the FastAPI endpoint does exactly two things:
- Saves the incident to PostgreSQL.
- Fires a Celery task with the incident ID.
That's it. The API responds in milliseconds. The AI investigation happens in the background, in a Celery worker process, completely decoupled from the HTTP lifecycle.
This means the platform stays available and responsive even if the local LLM is slow, under load, or temporarily down.
04 —What "Structured Analysis" Actually Means
Before we go deeper into the architecture, it's worth being specific about what the AI is actually supposed to produce.
When the investigation completes, a structured IncidentAnalysis record gets written to the database with these exact fields:
| Field | Type | Description |
|---|---|---|
root_cause | Text | The suspected cause of the incident |
reasoning | Text | The AI's chain of thought — how it reached the conclusion |
mitigation | Text | Immediate steps to resolve the incident right now |
prevention | Text | Architectural changes to prevent it from happening again |
recommendation | Text | Longer-term strategic recommendations |
confidence_score | Float | How confident the AI is in its analysis (0–100%) |
execution_status | String | COMPLETED, RECOVERED_FALLBACK, or FAILED |
Notice the execution_status field. This isn't just metadata — it's a signal of trust. If the AI completed its analysis cleanly, the status reads COMPLETED. If the LLM's output was malformed and the system fell back to safe defaults, the status shows RECOVERED_FALLBACK, and the engineer knows to review it more carefully.
This is what I mean by structured analysis. Not a paragraph of free text from a chatbot — a typed database record that an engineer can scan at a glance and a frontend can render as a structured dashboard.
05 —The LLM Layer: Privacy Without Sacrificing Scale
This was one of the most important architectural decisions in the entire project, and it's worth being honest about the trade-offs.
Production incident logs contain things you should never send to a public API:
- Database connection strings that show up in tracebacks
- Internal IP addresses and hostnames
- Employee usernames in authentication logs
- API keys that occasionally leak into error messages
- Customer data in request/response logs
So the constraint is clear: incident data cannot leave your infrastructure. But how you enforce that constraint matters a lot depending on where you are in the lifecycle of the product.
For the initial build, I used Ollama — a local LLM runtime that runs models directly on your machine. For prototyping and proof-of-concept work, it's excellent. You get fast iteration, zero API costs, and complete data isolation. I could test prompts, tune schemas, and validate the entire multi-agent pipeline without worrying about rate limits or cloud bills.
But let's be real: Ollama is not a production-grade serving solution. It runs on a single machine. It has no built-in horizontal scaling, no request batching, no auto-scaling, and limited concurrency. If your platform handles dozens or hundreds of concurrent incidents — which is exactly what an enterprise incident system needs to do during a large-scale outage — a single Ollama instance becomes a bottleneck very quickly.
For production at scale, there are two paths that maintain the same data privacy guarantee:
| Approach | When to Use | Key Advantage |
|---|---|---|
| Self-hosted model serving vLLM, TGI, Triton |
You want full control, have GPU infrastructure, and need maximum throughput | Continuous batching, tensor parallelism, horizontal scaling across GPU nodes |
| Private cloud LLM APIs Azure OpenAI, AWS Bedrock, Vertex AI |
You need elastic scaling without managing GPU hardware | Data residency guarantees, BAAs, auto-scaling, access to frontier models |
With vLLM, for example, you get continuous batching that can handle hundreds of concurrent requests, tensor parallelism across multiple GPUs, and an OpenAI-compatible API — meaning your existing code barely changes. With Azure OpenAI or AWS Bedrock, your data stays within your cloud region under a Business Associate Agreement, you get elastic auto-scaling, and you're running frontier-quality models without managing a single GPU driver.
The key architectural decision isn't which LLM provider you pick — it's making sure your system doesn't care. The LLMGateway class in the codebase is designed exactly for this:
// app/agents/llm/llm_gateway.py (simplified)
class LLMGateway:
def invoke(self, prompt: str) -> str:
# Dev: Ollama on localhost
# Staging: vLLM cluster on internal k8s
# Prod: Azure OpenAI (private endpoint)
response = self.llm.invoke(prompt)
return response
One interface. Three environments. Zero changes to the agents, the prompts, or the business logic. Ollama for your laptop, vLLM for your staging cluster, Azure OpenAI for production. The gateway absorbs the complexity so the rest of the system doesn't have to.
This is what designing for scale actually looks like: not picking the perfect tool on day one, but building an abstraction that lets you graduate from POC to production without rewriting the layer above it.
06 —The Audit Trail: Making AI Accountable
Here's something that most AI tutorials completely skip over: if you automate a decision, you need to be able to explain it.
This is especially true for incident response. If the AI says "the root cause is a memory leak in the payment-service," an engineer has to be able to verify that claim. They need to see:
- What data the AI actually looked at
- When the analysis kicked off
- Whether the analysis succeeded or fell back to a default state
- What the raw reasoning chain was
To solve this, the platform maintains a chronological IncidentEvent timeline for every incident. Every significant step in the AI workflow appends an event:
AI_ANALYSIS_STARTED → "AI Orchestrator has initiated the diagnostic workflow."
ROOT_CAUSE_GENERATED → "Root cause determined with 87% confidence."
AI_ANALYSIS_FAILED → "Automated RCA failed to structure. Marked for Manual Review."
These events are stored in the database and exposed via a GET /api/v1/incidents/{id}/timeline endpoint. Engineers can see the full chronological story of what happened — including what the AI did, when it did it, and whether it succeeded.
This timeline is the difference between an AI system that engineers trust and one that they quietly ignore.
07 —What's Coming in This Series
We've covered the why and the what. Over the next three weeks, we're going deep into the how.
Week 2 — Building the Backbone: We'll implement the FastAPI + Celery + Redis pipeline. You'll see exactly why LLM calls should never live inside an HTTP handler, and how to set up a task queue with exponential backoff retries.
Week 3 — The Self-Healing AI Agent: We'll build the ReasoningAgent — the core of the system. The interesting part: the agent has a self-healing retry loop. When the LLM returns malformed JSON, the agent injects the schema violation error back into the next prompt, forcing the model to correct itself. This is one of the most practical prompt engineering patterns I've used in production.
Week 4 — Trust and Production Hardening: We'll look at confidence scoring, the execution_status fallback system, role-based access control, and everything you need to deploy an AI system that engineers will actually trust when they're on call at 2 AM.
The next time you're staring at five dashboards at 2 AM, remember: the information needed to diagnose the problem is almost always already there. The bottleneck is the time it takes a human to pull it all together.
That bottleneck is solvable. Let's build.
Let's build smart. Let's build together.
— Gopal