We've all used ChatGPT, Gemini, or similar AI applications. You type a question, hit Enter, and a few seconds later the answer starts streaming in, word by word. It feels like magic.
But have you ever wondered what actually happens in the backend when you put a query in that chatbox? How does the system decide what to do with your question? How does it find the right information? How does that answer stream back to you in real-time?
I built an Enterprise AI Assistant from scratch, and in this post, I'm going to walk you through every single step — from the moment you press Enter to the final rendered response on your screen.
01 —1. The User Presses Enter — It All Begins Here
It starts on the frontend. The user types their query into a ChatInput component — an auto-resizing textarea that listens for the Enter key. The moment they press Enter (without Shift), the message bubbles up to the ChatWindow orchestrator, which does two things immediately:
- Renders the user's message as a bubble on the right side of the chat
- Fires a
POST /chat/streamrequest to the FastAPI backend
The important detail here is that this isn't a normal REST call. It's a streaming request. The frontend opens a persistent connection and reads the response line by line in real-time.
The request carries the query in the body and the user's authentication token as an HttpOnly cookie — meaning JavaScript can't access it, which prevents XSS attacks. A small but critical security decision.
02 —2. Authentication — Who Are You?
Before the backend processes anything, it needs to know who is asking. A FastAPI dependency called get_current_user() intercepts the request and:
- Extracts the JWT from the
access_tokencookie (or theAuthorizationheader as a fallback) - Decodes and verifies it using
python-jose - Extracts the username and role (e.g.,
employeeoradmin)
If the token is missing or expired — instant 401 Unauthorized. No exceptions.
Why does the role matter? Because later, during document retrieval, the system filters results based on the user's access level. An employee can only see employee-level documents. An admin sees everything. RBAC baked directly into the retrieval layer.
03 —3. Security Guardrails — The Two-Layer Shield
This is where things get interesting. Before the query even reaches the AI, it passes through a two-layer security pipeline.
Layer 1: PII Scanner
A regex-based scanner checks the query for Personally Identifiable Information — emails, phone numbers, SSNs, credit card numbers, Aadhaar numbers, PAN cards, IP addresses, and passport numbers. If anything is found, it's replaced with [REDACTED] and the event is logged.
Input: "My SSN is 123-45-6789, what is the remote work policy?"
Output: "My SSN is [REDACTED], what is the remote work policy?"
The critical design decision here: PII detection is non-blocking. It redacts and warns, but lets the query through. You don't want to block a legitimate question just because the user accidentally included personal data.
Layer 2: Prompt Injection Filter
This is the blocking layer. It defends against jailbreak and injection attacks in two stages:
- Stage 1 — Pattern Matching (~0ms): 15 regex patterns check for known jailbreak phrases like "ignore all previous instructions", "bypass your safety filters", "developer mode", etc. If there's a hard match, the request is blocked instantly.
- Stage 2 — LLM Classifier (~500ms): If Stage 1 doesn't find a hard match but detects ≥2 suspicious terms, it invokes the LLM itself with a binary classification prompt:
SAFEorUNSAFE. This catches the creative attacks that regex misses.
If blocked, the backend immediately streams a ⛔ Request blocked status event and closes the connection. The user sees the block reason in the chat. Clean and immediate.
04 —4. Semantic Routing — The Brain Router
Now comes the first real AI decision. The safe, PII-redacted query is sent to a Semantic Router — an LLM-powered classifier built as a LangGraph StateGraph.
The LLM reads the query and classifies it into exactly one of three categories:
| Route | When | Example |
|---|---|---|
greeting |
Casual salutation | "Hi!", "Good morning" |
rag |
Knowledge question | "What is our leave policy?" |
agent |
Action request | "Show me revenue data" |
Think of this as the traffic cop of the entire system. A greeting doesn't need to invoke a multi-billion-parameter model — it just returns a hardcoded "Hello! How can I help you?" instantly. This saves compute and keeps the UX snappy.
05 —5. The RAG Pipeline — Finding the Right Answer
This is the most complex path, and it's where I invested the most engineering effort. When a knowledge question comes in, it goes through a 6-step retrieval and generation pipeline.
Step 1: Semantic Cache
Before doing any heavy lifting, the system checks if a semantically similar question has already been answered. It embeds the query with all-MiniLM-L6-v2, then compares it against all cached embeddings using cosine similarity.
If the similarity is ≥ 0.92 — it's a cache HIT. The cached answer is streamed back instantly, and the frontend shows a ⚡ badge to indicate it came from cache. A full pipeline run takes 5-12 seconds. A cache hit? ~50 milliseconds.
Step 2: HyDE (Hypothetical Document Embedding)
This is one of my favourite techniques. The problem with raw user queries is that they're questions — "What is the remote work policy?" — but documents are statements. They don't match well in embedding space.
HyDE fixes this by asking the LLM to generate a hypothetical answer first. The system then searches using both the original query and the hypothetical document combined. This dramatically improves retrieval quality because the search query now looks like the target document.
Step 3: Vector Retrieval (ParentDocumentRetriever)
This is where the actual document search happens. I use a ParentDocumentRetriever — a two-tier chunking strategy:
- Child chunks (400 chars) — small, precise pieces that are embedded in ChromaDB using
BAAI/bge-small-en - Parent chunks (2000 chars) — larger context blocks stored in a local file store
The system searches the small chunks for precision, but returns the larger parent chunks so the LLM gets richer context. It retrieves the top 6 documents, filtered by the user's access_role.
Step 4: CrossEncoder Reranking
Vector similarity is a good first pass, but it's not always accurate enough. The CrossEncoder (ms-marco-MiniLM-L-6-v2) takes each (query, document) pair and produces a true relevance score. This is more accurate because it sees the query and document together, not as separate embeddings.
The 6 documents are re-scored, sorted, and only the top 3 move forward.
Step 5: Knowledge Graph Enrichment (GraphRAG)
Vector search excels at finding semantically similar text. But it struggles with relational questions like "Which department owns the remote work policy?" — facts that are scattered across different documents.
To solve this, I integrated Neo4j as a knowledge graph. The LLM converts the user query into a Cypher query, executes it against the graph, and appends the results as additional context. If Neo4j is unavailable, this step is silently skipped — graceful degradation, not a hard failure.
Step 6: LLM Generates the Final Answer
Now the LLM has everything it needs: 3 reranked documents from vector search, entity relationships from the knowledge graph, and a carefully structured prompt that requires citations and markdown formatting.
The LLM streams the answer token by token using .stream(). As tokens arrive, they're wrapped as NDJSON events and sent to the frontend in real-time. Once the stream completes, the full answer is cached in Redis for future queries.
06 —6. The Multi-Agent Supervisor — When Actions Are Needed
Not every query is a knowledge question. When a user asks "Show me Q3 revenue data" or "Send an email to the team", the Semantic Router sends it to the Agent path instead.
Here, a second LangGraph classifier — the Supervisor — analyses the query and delegates it to one of four specialized worker agents:
- 📊 Data Analyzer —
query_database,generate_chart,calc_stats - 🎧 Support Agent —
search_kb,create_ticket,check_sla - 🔍 Code Review —
analyze_quality,check_compliance - ⚡ General Agent —
send_email,generate_summary,start_workflow
Each worker is a LangGraph ReAct agent — it reasons about the query, calls its tools, observes the results, and iterates until it has a complete answer. The Supervisor also extracts structured data (chart JSON, table data) from tool outputs and packages them as ui_component events so the frontend can render interactive charts and tables.
07 —7. The Streaming Protocol — How the Response Gets Back
This is the glue that connects backend to frontend. The /chat/stream endpoint uses Newline-Delimited JSON (NDJSON) — each line is a self-contained JSON object with a type field.
The frontend's ChatWindow reads each line and routes it to the right component:
status→ AgentTimeline (step-by-step thinking display with spinners)token→ MessageBubble (real-time text with a blinking cursor)sources→ SourcesPanel (collapsible cited documents)ui_component→ DynamicRenderer (interactive charts and tables)artifact→ ArtifactPanel (slide-in side panel with Copy/Download)done→ Stops the loading state
One particularly smart heuristic: if the final response is ≥600 characters and contains markdown headings, it's automatically promoted to an artifact and displayed in the side panel instead of the chat — preventing the chat from being cluttered by long, structured documents.
08 —8. Memory — The System Remembers
After the stream completes, both the user's query and the AI's full response are persisted to the conversation memory. The system uses Redis as the primary store (a list per user, capped at 100 messages) with an automatic SQLite fallback if Redis is unavailable.
On the next query, this history is retrieved and the LLM compresses it into a short summary. This summary is prepended to the new query, giving the AI conversational context without overwhelming the context window.
09 —9. Observability — Everything is Logged
Throughout this entire flow, every step emits structured JSON logs and OpenTelemetry spans. Queries, routes, cache hits, response previews, latencies, PII redactions, guardrail blocks, LLM calls, tool failures — all captured.
This isn't just for debugging. These logs feed into monitoring dashboards and help me understand where the bottlenecks are, what users are asking, and whether the system is performing well.
10 —Putting It All Together
Here's the complete timeline for a typical RAG query:
| Step | What Happens | Time |
|---|---|---|
| User presses Enter | Frontend sends request | ~5ms |
| JWT auth check | Decode + verify token | ~1ms |
| PII scan | Regex pattern matching | ~1ms |
| Injection check | Pattern match + optional LLM | ~1-500ms |
| Semantic routing | LLM classifies query | ~800ms |
| Semantic cache | Embedding + cosine similarity | ~50ms |
| HyDE generation | LLM writes hypothetical doc | ~1.5s |
| Vector retrieval | ChromaDB search | ~200ms |
| Reranking | CrossEncoder scoring | ~100ms |
| Graph enrichment | LLM → Cypher → Neo4j | ~1.5s |
| Final answer | LLM streams token by token | ~3-8s |
| Memory persistence | Save to Redis/SQLite | ~5ms |
Total: ~5-12 seconds for a full pipeline run. On a cache hit: ~50ms.
The next time you type something into a chatbox and watch the answer stream in, remember — there's an entire orchestra playing behind that blinking cursor. Semantic routers deciding where to send your query. Security guardrails scanning for threats. Vector databases searching through millions of embeddings. Knowledge graphs connecting the dots. And a large language model carefully crafting every word of the response.
It's not magic. It's engineering.
Let's build smart. Let's build together.
— Gopal