READ
Results
📝
The Problem with Modern Incident Response (And How AI Fixes It)
AI & LLMs · 14 min
📝
What Happens When a User Writes Something in the Chatbox?
AI & LLMs · 18 min
📝
From Basic RAG to an Agentic Ecosystem: Multi-Agent, GraphRAG & Generative UI
AI & LLMs · 14 min
📝
How I Built an Enterprise AI Assistant Using RAG and Mistral LLM
AI & LLMs · 12 min
📝
Agentic AI in Low-Code Platforms: Is the Future Closer Than We Think?
AI & LLMs · 8 min
📝
Unlocking the Future of Low-Code: What's New in Appian 25.2
Appian · 15 min
📝
What’s New in Appian 25.3: A Deep Dive into the Future of Low-Code
Appian · 14 min
📝
10 Must-Know Data Modeling Best Practices for Appian Developers
Appian · 18 min
📝
Performance Optimization in Appian: Top 10 Proven Tips That Actually Work
Appian · 10 min
📝
Mastering Web API Design in Appian: Best Practices with Validations & Real-World Tips
Appian · 16 min
📝
How Generative AI Is Transforming Business in the BFS Sector
AI & LLMs · 7 min
👤
About Gopal
Page
📧
Subscribe to Newsletter
Action
🌗
Switch Theme
Action — try Chalk or Dusk
AI & LLMsAdvancedJune 27, 202614 min read

The Problem with Modern Incident Response (And How AI Fixes It)

You're staring at five dashboards at 2 AM, trying to figure out why production is down. What if the first hour of investigation could happen automatically? I built an AI Incident Response Platform — here's why, and the architecture behind it.

AuthorGopal
PublishedJun 27, 2026
Read time14 min
DifficultyAdvanced
Fig. 0 — The Problem with Modern Incident Response (And How AI Fixes It)
01

You type a message in Slack at 2 AM: "prod is down, anyone know why?"

You're staring at five browser tabs — Datadog, Kibana, PagerDuty, Grafana, and a Slack thread that's moving faster than you can read. What if the first hour of investigation could happen automatically, while you were still pouring your coffee?

That's the problem I set out to solve. I built an AI Incident Response Platform — a system that ingests incident data, invokes a multi-agent AI workflow, and automatically produces structured Root Cause Analysis: what broke, how to fix it, and how to prevent it next time.

In this four-part series, I'm going to walk you through every architectural decision — from the first API call to the final confidence score in the database.

But first: let's understand exactly why the current state of incident response is broken, and what a smarter system should look like.

01 —The State of Modern Incident Response

Ask any on-call engineer what happens when an alert fires, and you'll hear some version of the same story.

Step 1: PagerDuty wakes you up. The alert says something vague like HighErrorRate or P99LatencyExceeded.

Step 2: You open Datadog and filter logs for the last 30 minutes. You're looking for red lines in a sea of noise.

Step 3: You hop into Grafana to check CPU and memory. Nothing obvious.

Step 4: You check recent deployments. A new version went out 2 hours ago. Maybe that?

Step 5: You write a message in Slack: "I think it might be the new cart-service deployment. Investigating."

Step 6: 45 minutes later, you've confirmed it. You roll back. The alert clears. You write a postmortem at 4 AM.

Sound familiar?

The problem isn't that the tools are bad. Datadog, Grafana, Kibana — they're excellent at what they do. The problem is context fragmentation. The signal that points to the root cause is always there, scattered across multiple observability platforms. But a human has to manually pull it all together, in real time, under pressure, often running on zero sleep.

This is precisely the kind of work that structured AI reasoning excels at.

02 —Why LLMs Are a Natural Fit for Incident Diagnosis

At their core, LLMs are pattern-matching machines. They've been trained on enormous corpora of technical documentation, Stack Overflow posts, GitHub issues, and engineering postmortems. When you hand an LLM a set of error logs, deployment metadata, and a severity label — and ask it to reason about root cause — it can connect dots across multiple signals in a way that would take a human significantly longer.

But here's the critical insight that most people miss: you can't just dump logs into ChatGPT and call it done.

There are four hard problems that a production-ready AI incident system has to solve:

Problem Why It Matters
Latency LLM inference can take 30–60 seconds. Your incident API cannot block that long.
Privacy Production logs often contain infrastructure secrets, PII, or proprietary data. Sending them to a public API is a non-starter.
Structure Engineers need structured output — root cause, mitigation, prevention — not a chatbot paragraph.
Reliability LLMs sometimes return malformed JSON. The system has to handle this gracefully without crashing.

Each of these problems needs a deliberate architectural decision. And together, they define the shape of the entire system.

03 —The Architecture: A Bird's Eye View

Here's the high-level picture of what I built.

Fig. 1 — AI Incident Response Platform Architecture
POST /incidents FastAPI · Auth · Save fire & forget Celery Worker Redis Queue · Retries Async processing AI Investigation Pipeline MonitoringAgent Gather signals ReasoningAgent LLM · RCA IncidentState: logs · metrics · traces · deploys PostgreSQL + Audit root_cause · mitigation · confidence 🔒 Private LLM vLLM · Azure · Bedrock Data stays in your infra
The API responds in milliseconds; all AI work runs asynchronously in a Celery worker with a local LLM

The key insight in this design is the strict separation between the synchronous API and the asynchronous AI pipeline. When an incident is created, the FastAPI endpoint does exactly two things:

  1. Saves the incident to PostgreSQL.
  2. Fires a Celery task with the incident ID.

That's it. The API responds in milliseconds. The AI investigation happens in the background, in a Celery worker process, completely decoupled from the HTTP lifecycle.

This means the platform stays available and responsive even if the local LLM is slow, under load, or temporarily down.

04 —What "Structured Analysis" Actually Means

Before we go deeper into the architecture, it's worth being specific about what the AI is actually supposed to produce.

When the investigation completes, a structured IncidentAnalysis record gets written to the database with these exact fields:

Field Type Description
root_causeTextThe suspected cause of the incident
reasoningTextThe AI's chain of thought — how it reached the conclusion
mitigationTextImmediate steps to resolve the incident right now
preventionTextArchitectural changes to prevent it from happening again
recommendationTextLonger-term strategic recommendations
confidence_scoreFloatHow confident the AI is in its analysis (0–100%)
execution_statusStringCOMPLETED, RECOVERED_FALLBACK, or FAILED

Notice the execution_status field. This isn't just metadata — it's a signal of trust. If the AI completed its analysis cleanly, the status reads COMPLETED. If the LLM's output was malformed and the system fell back to safe defaults, the status shows RECOVERED_FALLBACK, and the engineer knows to review it more carefully.

This is what I mean by structured analysis. Not a paragraph of free text from a chatbot — a typed database record that an engineer can scan at a glance and a frontend can render as a structured dashboard.

05 —The LLM Layer: Privacy Without Sacrificing Scale

This was one of the most important architectural decisions in the entire project, and it's worth being honest about the trade-offs.

Production incident logs contain things you should never send to a public API:

  • Database connection strings that show up in tracebacks
  • Internal IP addresses and hostnames
  • Employee usernames in authentication logs
  • API keys that occasionally leak into error messages
  • Customer data in request/response logs

So the constraint is clear: incident data cannot leave your infrastructure. But how you enforce that constraint matters a lot depending on where you are in the lifecycle of the product.

For the initial build, I used Ollama — a local LLM runtime that runs models directly on your machine. For prototyping and proof-of-concept work, it's excellent. You get fast iteration, zero API costs, and complete data isolation. I could test prompts, tune schemas, and validate the entire multi-agent pipeline without worrying about rate limits or cloud bills.

But let's be real: Ollama is not a production-grade serving solution. It runs on a single machine. It has no built-in horizontal scaling, no request batching, no auto-scaling, and limited concurrency. If your platform handles dozens or hundreds of concurrent incidents — which is exactly what an enterprise incident system needs to do during a large-scale outage — a single Ollama instance becomes a bottleneck very quickly.

For production at scale, there are two paths that maintain the same data privacy guarantee:

Approach When to Use Key Advantage
Self-hosted model serving
vLLM, TGI, Triton
You want full control, have GPU infrastructure, and need maximum throughput Continuous batching, tensor parallelism, horizontal scaling across GPU nodes
Private cloud LLM APIs
Azure OpenAI, AWS Bedrock, Vertex AI
You need elastic scaling without managing GPU hardware Data residency guarantees, BAAs, auto-scaling, access to frontier models

With vLLM, for example, you get continuous batching that can handle hundreds of concurrent requests, tensor parallelism across multiple GPUs, and an OpenAI-compatible API — meaning your existing code barely changes. With Azure OpenAI or AWS Bedrock, your data stays within your cloud region under a Business Associate Agreement, you get elastic auto-scaling, and you're running frontier-quality models without managing a single GPU driver.

The key architectural decision isn't which LLM provider you pick — it's making sure your system doesn't care. The LLMGateway class in the codebase is designed exactly for this:

// app/agents/llm/llm_gateway.py (simplified)
class LLMGateway:
    def invoke(self, prompt: str) -> str:
        # Dev: Ollama on localhost
        # Staging: vLLM cluster on internal k8s
        # Prod: Azure OpenAI (private endpoint)
        response = self.llm.invoke(prompt)
        return response

One interface. Three environments. Zero changes to the agents, the prompts, or the business logic. Ollama for your laptop, vLLM for your staging cluster, Azure OpenAI for production. The gateway absorbs the complexity so the rest of the system doesn't have to.

This is what designing for scale actually looks like: not picking the perfect tool on day one, but building an abstraction that lets you graduate from POC to production without rewriting the layer above it.

06 —The Audit Trail: Making AI Accountable

Here's something that most AI tutorials completely skip over: if you automate a decision, you need to be able to explain it.

This is especially true for incident response. If the AI says "the root cause is a memory leak in the payment-service," an engineer has to be able to verify that claim. They need to see:

  • What data the AI actually looked at
  • When the analysis kicked off
  • Whether the analysis succeeded or fell back to a default state
  • What the raw reasoning chain was

To solve this, the platform maintains a chronological IncidentEvent timeline for every incident. Every significant step in the AI workflow appends an event:

AI_ANALYSIS_STARTED    → "AI Orchestrator has initiated the diagnostic workflow."
ROOT_CAUSE_GENERATED   → "Root cause determined with 87% confidence."
AI_ANALYSIS_FAILED     → "Automated RCA failed to structure. Marked for Manual Review."

These events are stored in the database and exposed via a GET /api/v1/incidents/{id}/timeline endpoint. Engineers can see the full chronological story of what happened — including what the AI did, when it did it, and whether it succeeded.

This timeline is the difference between an AI system that engineers trust and one that they quietly ignore.

07 —What's Coming in This Series

We've covered the why and the what. Over the next three weeks, we're going deep into the how.

Week 2 — Building the Backbone: We'll implement the FastAPI + Celery + Redis pipeline. You'll see exactly why LLM calls should never live inside an HTTP handler, and how to set up a task queue with exponential backoff retries.

Week 3 — The Self-Healing AI Agent: We'll build the ReasoningAgent — the core of the system. The interesting part: the agent has a self-healing retry loop. When the LLM returns malformed JSON, the agent injects the schema violation error back into the next prompt, forcing the model to correct itself. This is one of the most practical prompt engineering patterns I've used in production.

Week 4 — Trust and Production Hardening: We'll look at confidence scoring, the execution_status fallback system, role-based access control, and everything you need to deploy an AI system that engineers will actually trust when they're on call at 2 AM.

The next time you're staring at five dashboards at 2 AM, remember: the information needed to diagnose the problem is almost always already there. The bottleneck is the time it takes a human to pull it all together.

That bottleneck is solvable. Let's build.

Let's build smart. Let's build together.

— Gopal

Keep
Reading

More from the archive
© 2026 Ai TechSavvy. All rights reserved.Crafted by Gopal Kumar