READ
Results
📝
What Happens When a User Writes Something in the Chatbox?
AI & LLMs · 18 min
📝
From Basic RAG to an Agentic Ecosystem: Multi-Agent, GraphRAG & Generative UI
AI & LLMs · 14 min
📝
How I Built an Enterprise AI Assistant Using RAG and Mistral LLM
AI & LLMs · 12 min
📝
Agentic AI in Low-Code Platforms: Is the Future Closer Than We Think?
AI & LLMs · 8 min
📝
Unlocking the Future of Low-Code: What's New in Appian 25.2
Appian · 15 min
📝
What’s New in Appian 25.3: A Deep Dive into the Future of Low-Code
Appian · 14 min
📝
10 Must-Know Data Modeling Best Practices for Appian Developers
Appian · 18 min
📝
Performance Optimization in Appian: Top 10 Proven Tips That Actually Work
Appian · 10 min
📝
Mastering Web API Design in Appian: Best Practices with Validations & Real-World Tips
Appian · 16 min
📝
How Generative AI Is Transforming Business in the BFS Sector
AI & LLMs · 7 min
👤
About Gopal
Page
📧
Subscribe to Newsletter
Action
🌗
Switch Theme
Action — try Chalk or Dusk
AI & LLMsAdvancedMay 7, 202618 min read

What Happens When a User Writes Something in the Chatbox?

You type a question, hit Enter, and the answer streams in word by word. But what actually happens behind the scenes? I built an Enterprise AI Assistant from scratch — here is every single step, from keypress to rendered response.

AuthorGopal
PublishedMay 7, 2026
Read time18 min
DifficultyAdvanced
Fig. 0 — What Happens When a User Writes Something in the Chatbox?
01

We've all used ChatGPT, Gemini, or similar AI applications. You type a question, hit Enter, and a few seconds later the answer starts streaming in, word by word. It feels like magic.

But have you ever wondered what actually happens in the backend when you put a query in that chatbox? How does the system decide what to do with your question? How does it find the right information? How does that answer stream back to you in real-time?

I built an Enterprise AI Assistant from scratch, and in this post, I'm going to walk you through every single step — from the moment you press Enter to the final rendered response on your screen.

01 —1. The User Presses Enter — It All Begins Here

It starts on the frontend. The user types their query into a ChatInput component — an auto-resizing textarea that listens for the Enter key. The moment they press Enter (without Shift), the message bubbles up to the ChatWindow orchestrator, which does two things immediately:

  • Renders the user's message as a bubble on the right side of the chat
  • Fires a POST /chat/stream request to the FastAPI backend

The important detail here is that this isn't a normal REST call. It's a streaming request. The frontend opens a persistent connection and reads the response line by line in real-time.

The request carries the query in the body and the user's authentication token as an HttpOnly cookie — meaning JavaScript can't access it, which prevents XSS attacks. A small but critical security decision.

02 —2. Authentication — Who Are You?

Before the backend processes anything, it needs to know who is asking. A FastAPI dependency called get_current_user() intercepts the request and:

  1. Extracts the JWT from the access_token cookie (or the Authorization header as a fallback)
  2. Decodes and verifies it using python-jose
  3. Extracts the username and role (e.g., employee or admin)

If the token is missing or expired — instant 401 Unauthorized. No exceptions.

Why does the role matter? Because later, during document retrieval, the system filters results based on the user's access level. An employee can only see employee-level documents. An admin sees everything. RBAC baked directly into the retrieval layer.

03 —3. Security Guardrails — The Two-Layer Shield

This is where things get interesting. Before the query even reaches the AI, it passes through a two-layer security pipeline.

Layer 1: PII Scanner

A regex-based scanner checks the query for Personally Identifiable Information — emails, phone numbers, SSNs, credit card numbers, Aadhaar numbers, PAN cards, IP addresses, and passport numbers. If anything is found, it's replaced with [REDACTED] and the event is logged.

Input:  "My SSN is 123-45-6789, what is the remote work policy?"
Output: "My SSN is [REDACTED], what is the remote work policy?"

The critical design decision here: PII detection is non-blocking. It redacts and warns, but lets the query through. You don't want to block a legitimate question just because the user accidentally included personal data.

Layer 2: Prompt Injection Filter

This is the blocking layer. It defends against jailbreak and injection attacks in two stages:

  • Stage 1 — Pattern Matching (~0ms): 15 regex patterns check for known jailbreak phrases like "ignore all previous instructions", "bypass your safety filters", "developer mode", etc. If there's a hard match, the request is blocked instantly.
  • Stage 2 — LLM Classifier (~500ms): If Stage 1 doesn't find a hard match but detects ≥2 suspicious terms, it invokes the LLM itself with a binary classification prompt: SAFE or UNSAFE. This catches the creative attacks that regex misses.

If blocked, the backend immediately streams a ⛔ Request blocked status event and closes the connection. The user sees the block reason in the chat. Clean and immediate.

Fig. 1 — Two-Layer Security Guardrails
RAW Query 🔍 PII Scanner Redact → Continue clean 🛡️ Injection Filter Stage 1: 15 Regex ~0ms Stage 2: LLM ~500ms ⛔ BLOCK ✅ PASS unsafe safe
PII is redacted non-blockingly; injection attacks are caught by regex then escalated to an LLM classifier

04 —4. Semantic Routing — The Brain Router

Now comes the first real AI decision. The safe, PII-redacted query is sent to a Semantic Router — an LLM-powered classifier built as a LangGraph StateGraph.

The LLM reads the query and classifies it into exactly one of three categories:

Route When Example
greeting Casual salutation "Hi!", "Good morning"
rag Knowledge question "What is our leave policy?"
agent Action request "Show me revenue data"

Think of this as the traffic cop of the entire system. A greeting doesn't need to invoke a multi-billion-parameter model — it just returns a hardcoded "Hello! How can I help you?" instantly. This saves compute and keeps the UX snappy.

Fig. 3 — Semantic Router: Query Classification
SAFE Query 🧠 LLM Classifier 👋 greeting Instant response — no LLM 📚 rag 6-step RAG pipeline 🤖 agent Multi-Agent Supervisor
The Semantic Router acts as the system's traffic cop — saving compute by short-circuiting greetings

05 —5. The RAG Pipeline — Finding the Right Answer

This is the most complex path, and it's where I invested the most engineering effort. When a knowledge question comes in, it goes through a 6-step retrieval and generation pipeline.

Fig. 2 — 6-Step RAG Retrieval & Generation Pipeline
QUERY User input 1. Cache cosine ≥ 0.92? MISS HIT ⚡ → Stream cached 2. HyDE Hypothetical doc 3. ChromaDB Top 6 chunks 4. Rerank CrossEncoder → Top 3 5. GraphRAG Neo4j enrichment 6. LLM Stream tokens + cache result
Semantic cache short-circuits the pipeline; HyDE + CrossEncoder + GraphRAG maximise retrieval quality

Step 1: Semantic Cache

Before doing any heavy lifting, the system checks if a semantically similar question has already been answered. It embeds the query with all-MiniLM-L6-v2, then compares it against all cached embeddings using cosine similarity.

If the similarity is ≥ 0.92 — it's a cache HIT. The cached answer is streamed back instantly, and the frontend shows a ⚡ badge to indicate it came from cache. A full pipeline run takes 5-12 seconds. A cache hit? ~50 milliseconds.

Step 2: HyDE (Hypothetical Document Embedding)

This is one of my favourite techniques. The problem with raw user queries is that they're questions — "What is the remote work policy?" — but documents are statements. They don't match well in embedding space.

HyDE fixes this by asking the LLM to generate a hypothetical answer first. The system then searches using both the original query and the hypothetical document combined. This dramatically improves retrieval quality because the search query now looks like the target document.

Step 3: Vector Retrieval (ParentDocumentRetriever)

This is where the actual document search happens. I use a ParentDocumentRetriever — a two-tier chunking strategy:

  • Child chunks (400 chars) — small, precise pieces that are embedded in ChromaDB using BAAI/bge-small-en
  • Parent chunks (2000 chars) — larger context blocks stored in a local file store

The system searches the small chunks for precision, but returns the larger parent chunks so the LLM gets richer context. It retrieves the top 6 documents, filtered by the user's access_role.

Step 4: CrossEncoder Reranking

Vector similarity is a good first pass, but it's not always accurate enough. The CrossEncoder (ms-marco-MiniLM-L-6-v2) takes each (query, document) pair and produces a true relevance score. This is more accurate because it sees the query and document together, not as separate embeddings.

The 6 documents are re-scored, sorted, and only the top 3 move forward.

Step 5: Knowledge Graph Enrichment (GraphRAG)

Vector search excels at finding semantically similar text. But it struggles with relational questions like "Which department owns the remote work policy?" — facts that are scattered across different documents.

To solve this, I integrated Neo4j as a knowledge graph. The LLM converts the user query into a Cypher query, executes it against the graph, and appends the results as additional context. If Neo4j is unavailable, this step is silently skipped — graceful degradation, not a hard failure.

Step 6: LLM Generates the Final Answer

Now the LLM has everything it needs: 3 reranked documents from vector search, entity relationships from the knowledge graph, and a carefully structured prompt that requires citations and markdown formatting.

The LLM streams the answer token by token using .stream(). As tokens arrive, they're wrapped as NDJSON events and sent to the frontend in real-time. Once the stream completes, the full answer is cached in Redis for future queries.

06 —6. The Multi-Agent Supervisor — When Actions Are Needed

Not every query is a knowledge question. When a user asks "Show me Q3 revenue data" or "Send an email to the team", the Semantic Router sends it to the Agent path instead.

Here, a second LangGraph classifier — the Supervisor — analyses the query and delegates it to one of four specialized worker agents:

  • 📊 Data Analyzerquery_database, generate_chart, calc_stats
  • 🎧 Support Agentsearch_kb, create_ticket, check_sla
  • 🔍 Code Reviewanalyze_quality, check_compliance
  • ⚡ General Agentsend_email, generate_summary, start_workflow

Each worker is a LangGraph ReAct agent — it reasons about the query, calls its tools, observes the results, and iterates until it has a complete answer. The Supervisor also extracts structured data (chart JSON, table data) from tool outputs and packages them as ui_component events so the frontend can render interactive charts and tables.

07 —7. The Streaming Protocol — How the Response Gets Back

This is the glue that connects backend to frontend. The /chat/stream endpoint uses Newline-Delimited JSON (NDJSON) — each line is a self-contained JSON object with a type field.

The frontend's ChatWindow reads each line and routes it to the right component:

  • status → AgentTimeline (step-by-step thinking display with spinners)
  • token → MessageBubble (real-time text with a blinking cursor)
  • sources → SourcesPanel (collapsible cited documents)
  • ui_component → DynamicRenderer (interactive charts and tables)
  • artifact → ArtifactPanel (slide-in side panel with Copy/Download)
  • done → Stops the loading state

One particularly smart heuristic: if the final response is ≥600 characters and contains markdown headings, it's automatically promoted to an artifact and displayed in the side panel instead of the chat — preventing the chat from being cluttered by long, structured documents.

08 —8. Memory — The System Remembers

After the stream completes, both the user's query and the AI's full response are persisted to the conversation memory. The system uses Redis as the primary store (a list per user, capped at 100 messages) with an automatic SQLite fallback if Redis is unavailable.

On the next query, this history is retrieved and the LLM compresses it into a short summary. This summary is prepended to the new query, giving the AI conversational context without overwhelming the context window.

09 —9. Observability — Everything is Logged

Throughout this entire flow, every step emits structured JSON logs and OpenTelemetry spans. Queries, routes, cache hits, response previews, latencies, PII redactions, guardrail blocks, LLM calls, tool failures — all captured.

This isn't just for debugging. These logs feed into monitoring dashboards and help me understand where the bottlenecks are, what users are asking, and whether the system is performing well.

10 —Putting It All Together

Here's the complete timeline for a typical RAG query:

Step What Happens Time
User presses EnterFrontend sends request~5ms
JWT auth checkDecode + verify token~1ms
PII scanRegex pattern matching~1ms
Injection checkPattern match + optional LLM~1-500ms
Semantic routingLLM classifies query~800ms
Semantic cacheEmbedding + cosine similarity~50ms
HyDE generationLLM writes hypothetical doc~1.5s
Vector retrievalChromaDB search~200ms
RerankingCrossEncoder scoring~100ms
Graph enrichmentLLM → Cypher → Neo4j~1.5s
Final answerLLM streams token by token~3-8s
Memory persistenceSave to Redis/SQLite~5ms

Total: ~5-12 seconds for a full pipeline run. On a cache hit: ~50ms.

The next time you type something into a chatbox and watch the answer stream in, remember — there's an entire orchestra playing behind that blinking cursor. Semantic routers deciding where to send your query. Security guardrails scanning for threats. Vector databases searching through millions of embeddings. Knowledge graphs connecting the dots. And a large language model carefully crafting every word of the response.

It's not magic. It's engineering.

Let's build smart. Let's build together.

— Gopal

Keep
Reading

More from the archive
© 2026 Ai TechSavvy. All rights reserved.Crafted by Gopal Kumar