Ai TechSavvy — AI & Enterprise Tech in Depth

We've all used ChatGPT, Gemini, or similar AI applications. You type a question, hit Enter, and a few seconds later the answer starts streaming in, word by word. It feels like magic.

But have you ever wondered what actually happens in the backend when you put a query in that chatbox? How does the system decide what to do with your question? How does it find the right information? How does that answer stream back to you in real-time?

I built an Enterprise AI Assistant from scratch, and in this post, I'm going to walk you through every single step — from the moment you press Enter to the final rendered response on your screen.

01 —1. The User Presses Enter — It All Begins Here

It starts on the frontend. The user types their query into a ChatInput component — an auto-resizing textarea that listens for the Enter key. The moment they press Enter (without Shift), the message bubbles up to the ChatWindow orchestrator, which does two things immediately:

Renders the user's message as a bubble on the right side of the chat
Fires a POST /chat/stream request to the FastAPI backend

The important detail here is that this isn't a normal REST call. It's a streaming request. The frontend opens a persistent connection and reads the response line by line in real-time.

The request carries the query in the body and the user's authentication token as an HttpOnly cookie — meaning JavaScript can't access it, which prevents XSS attacks. A small but critical security decision.

02 —2. Authentication — Who Are You?

Before the backend processes anything, it needs to know who is asking. A FastAPI dependency called get_current_user() intercepts the request and:

Extracts the JWT from the access_token cookie (or the Authorization header as a fallback)
Decodes and verifies it using python-jose
Extracts the username and role (e.g., employee or admin)

If the token is missing or expired — instant 401 Unauthorized. No exceptions.

Why does the role matter? Because later, during document retrieval, the system filters results based on the user's access level. An employee can only see employee-level documents. An admin sees everything. RBAC baked directly into the retrieval layer.

03 —3. Security Guardrails — The Two-Layer Shield

This is where things get interesting. Before the query even reaches the AI, it passes through a two-layer security pipeline.

Layer 1: PII Scanner

A regex-based scanner checks the query for Personally Identifiable Information — emails, phone numbers, SSNs, credit card numbers, Aadhaar numbers, PAN cards, IP addresses, and passport numbers. If anything is found, it's replaced with [REDACTED] and the event is logged.

Input:  "My SSN is 123-45-6789, what is the remote work policy?"
Output: "My SSN is [REDACTED], what is the remote work policy?"

The critical design decision here: PII detection is non-blocking. It redacts and warns, but lets the query through. You don't want to block a legitimate question just because the user accidentally included personal data.

Layer 2: Prompt Injection Filter

This is the blocking layer. It defends against jailbreak and injection attacks in two stages:

Stage 1 — Pattern Matching (~0ms): 15 regex patterns check for known jailbreak phrases like "ignore all previous instructions", "bypass your safety filters", "developer mode", etc. If there's a hard match, the request is blocked instantly.
Stage 2 — LLM Classifier (~500ms): If Stage 1 doesn't find a hard match but detects ≥2 suspicious terms, it invokes the LLM itself with a binary classification prompt: SAFE or UNSAFE. This catches the creative attacks that regex misses.

If blocked, the backend immediately streams a ⛔ Request blocked status event and closes the connection. The user sees the block reason in the chat. Clean and immediate.

Fig. 1 — Two-Layer Security Guardrails

    PII is redacted non-blockingly; injection attacks are caught by regex then escalated to an LLM classifier
  

04 —4. Semantic Routing — The Brain Router

Now comes the first real AI decision. The safe, PII-redacted query is sent to a Semantic Router — an LLM-powered classifier built as a LangGraph StateGraph.

The LLM reads the query and classifies it into exactly one of three categories:

Route	When	Example
`greeting`	Casual salutation	"Hi!", "Good morning"
`rag`	Knowledge question	"What is our leave policy?"
`agent`	Action request	"Show me revenue data"

Think of this as the traffic cop of the entire system. A greeting doesn't need to invoke a multi-billion-parameter model — it just returns a hardcoded "Hello! How can I help you?" instantly. This saves compute and keeps the UX snappy.

Fig. 3 — Semantic Router: Query Classification

    The Semantic Router acts as the system's traffic cop — saving compute by short-circuiting greetings
  

05 —5. The RAG Pipeline — Finding the Right Answer

This is the most complex path, and it's where I invested the most engineering effort. When a knowledge question comes in, it goes through a 6-step retrieval and generation pipeline.

Fig. 2 — 6-Step RAG Retrieval & Generation Pipeline

    Semantic cache short-circuits the pipeline; HyDE + CrossEncoder + GraphRAG maximise retrieval quality
  

Step 1: Semantic Cache

Before doing any heavy lifting, the system checks if a semantically similar question has already been answered. It embeds the query with all-MiniLM-L6-v2, then compares it against all cached embeddings using cosine similarity.

If the similarity is ≥ 0.92 — it's a cache HIT. The cached answer is streamed back instantly, and the frontend shows a ⚡ badge to indicate it came from cache. A full pipeline run takes 5-12 seconds. A cache hit? ~50 milliseconds.

Step 2: HyDE (Hypothetical Document Embedding)

This is one of my favourite techniques. The problem with raw user queries is that they're questions — "What is the remote work policy?" — but documents are statements. They don't match well in embedding space.

HyDE fixes this by asking the LLM to generate a hypothetical answer first. The system then searches using both the original query and the hypothetical document combined. This dramatically improves retrieval quality because the search query now looks like the target document.

Step 3: Vector Retrieval (ParentDocumentRetriever)

This is where the actual document search happens. I use a ParentDocumentRetriever — a two-tier chunking strategy:

Child chunks (400 chars) — small, precise pieces that are embedded in ChromaDB using BAAI/bge-small-en
Parent chunks (2000 chars) — larger context blocks stored in a local file store

The system searches the small chunks for precision, but returns the larger parent chunks so the LLM gets richer context. It retrieves the top 6 documents, filtered by the user's access_role.

Step 4: CrossEncoder Reranking

Vector similarity is a good first pass, but it's not always accurate enough. The CrossEncoder (ms-marco-MiniLM-L-6-v2) takes each (query, document) pair and produces a true relevance score. This is more accurate because it sees the query and document together, not as separate embeddings.

The 6 documents are re-scored, sorted, and only the top 3 move forward.

Step 5: Knowledge Graph Enrichment (GraphRAG)

Vector search excels at finding semantically similar text. But it struggles with relational questions like "Which department owns the remote work policy?" — facts that are scattered across different documents.

To solve this, I integrated Neo4j as a knowledge graph. The LLM converts the user query into a Cypher query, executes it against the graph, and appends the results as additional context. If Neo4j is unavailable, this step is silently skipped — graceful degradation, not a hard failure.

Step 6: LLM Generates the Final Answer

Now the LLM has everything it needs: 3 reranked documents from vector search, entity relationships from the knowledge graph, and a carefully structured prompt that requires citations and markdown formatting.

The LLM streams the answer token by token using .stream(). As tokens arrive, they're wrapped as NDJSON events and sent to the frontend in real-time. Once the stream completes, the full answer is cached in Redis for future queries.

06 —6. The Multi-Agent Supervisor — When Actions Are Needed

Not every query is a knowledge question. When a user asks "Show me Q3 revenue data" or "Send an email to the team", the Semantic Router sends it to the Agent path instead.

Here, a second LangGraph classifier — the Supervisor — analyses the query and delegates it to one of four specialized worker agents:

📊 Data Analyzer — query_database, generate_chart, calc_stats
🎧 Support Agent — search_kb, create_ticket, check_sla
🔍 Code Review — analyze_quality, check_compliance
⚡ General Agent — send_email, generate_summary, start_workflow

Each worker is a LangGraph ReAct agent — it reasons about the query, calls its tools, observes the results, and iterates until it has a complete answer. The Supervisor also extracts structured data (chart JSON, table data) from tool outputs and packages them as ui_component events so the frontend can render interactive charts and tables.

07 —7. The Streaming Protocol — How the Response Gets Back

This is the glue that connects backend to frontend. The /chat/stream endpoint uses Newline-Delimited JSON (NDJSON) — each line is a self-contained JSON object with a type field.

The frontend's ChatWindow reads each line and routes it to the right component:

status → AgentTimeline (step-by-step thinking display with spinners)
token → MessageBubble (real-time text with a blinking cursor)
sources → SourcesPanel (collapsible cited documents)
ui_component → DynamicRenderer (interactive charts and tables)
artifact → ArtifactPanel (slide-in side panel with Copy/Download)
done → Stops the loading state

One particularly smart heuristic: if the final response is ≥600 characters and contains markdown headings, it's automatically promoted to an artifact and displayed in the side panel instead of the chat — preventing the chat from being cluttered by long, structured documents.

08 —8. Memory — The System Remembers

After the stream completes, both the user's query and the AI's full response are persisted to the conversation memory. The system uses Redis as the primary store (a list per user, capped at 100 messages) with an automatic SQLite fallback if Redis is unavailable.

On the next query, this history is retrieved and the LLM compresses it into a short summary. This summary is prepended to the new query, giving the AI conversational context without overwhelming the context window.

09 —9. Observability — Everything is Logged

Throughout this entire flow, every step emits structured JSON logs and OpenTelemetry spans. Queries, routes, cache hits, response previews, latencies, PII redactions, guardrail blocks, LLM calls, tool failures — all captured.

This isn't just for debugging. These logs feed into monitoring dashboards and help me understand where the bottlenecks are, what users are asking, and whether the system is performing well.

10 —Putting It All Together

Here's the complete timeline for a typical RAG query:

Step	What Happens	Time
User presses Enter	Frontend sends request	~5ms
JWT auth check	Decode + verify token	~1ms
PII scan	Regex pattern matching	~1ms
Injection check	Pattern match + optional LLM	~1-500ms
Semantic routing	LLM classifies query	~800ms
Semantic cache	Embedding + cosine similarity	~50ms
HyDE generation	LLM writes hypothetical doc	~1.5s
Vector retrieval	ChromaDB search	~200ms
Reranking	CrossEncoder scoring	~100ms
Graph enrichment	LLM → Cypher → Neo4j	~1.5s
Final answer	LLM streams token by token	~3-8s
Memory persistence	Save to Redis/SQLite	~5ms

Total: ~5-12 seconds for a full pipeline run. On a cache hit: ~50ms.

The next time you type something into a chatbox and watch the answer stream in, remember — there's an entire orchestra playing behind that blinking cursor. Semantic routers deciding where to send your query. Security guardrails scanning for threats. Vector databases searching through millions of embeddings. Knowledge graphs connecting the dots. And a large language model carefully crafting every word of the response.

It's not magic. It's engineering.

Let's build smart. Let's build together.

— Gopal

Written by

👨‍💻

Gopal

Tech Writer · AI & Enterprise

AI & LLMsArchitectureBest Practices

What Happens When a User Writes Something in the Chatbox?

01 —1. The User Presses Enter — It All Begins Here

02 —2. Authentication — Who Are You?

03 —3. Security Guardrails — The Two-Layer Shield

Layer 1: PII Scanner

Layer 2: Prompt Injection Filter

04 —4. Semantic Routing — The Brain Router

05 —5. The RAG Pipeline — Finding the Right Answer

Step 1: Semantic Cache

Step 2: HyDE (Hypothetical Document Embedding)

Step 3: Vector Retrieval (ParentDocumentRetriever)

Step 4: CrossEncoder Reranking

Step 5: Knowledge Graph Enrichment (GraphRAG)

Step 6: LLM Generates the Final Answer

06 —6. The Multi-Agent Supervisor — When Actions Are Needed

07 —7. The Streaming Protocol — How the Response Gets Back

08 —8. Memory — The System Remembers

09 —9. Observability — Everything is Logged

10 —Putting It All Together

Keep
Reading

What Happens When a User Writes Something in the Chatbox?

01 —1. The User Presses Enter — It All Begins Here

02 —2. Authentication — Who Are You?

03 —3. Security Guardrails — The Two-Layer Shield

Layer 1: PII Scanner

Layer 2: Prompt Injection Filter

04 —4. Semantic Routing — The Brain Router

05 —5. The RAG Pipeline — Finding the Right Answer

Step 1: Semantic Cache

Step 2: HyDE (Hypothetical Document Embedding)

Step 3: Vector Retrieval (ParentDocumentRetriever)

Step 4: CrossEncoder Reranking

Step 5: Knowledge Graph Enrichment (GraphRAG)

Step 6: LLM Generates the Final Answer

06 —6. The Multi-Agent Supervisor — When Actions Are Needed

07 —7. The Streaming Protocol — How the Response Gets Back

08 —8. Memory — The System Remembers

09 —9. Observability — Everything is Logged

10 —Putting It All Together

KeepReading

Keep
Reading