Free Monthly Masterclasses for Paid Subscribers
I am launching a monthly Masterclass Series beginning in October
These sessions will feature senior engineers and industry leaders, offering exclusive insights and learnings. Access will be free for all paid subscribers
If you have been considering an upgrade to paid, now is the ideal time to join and secure access to this exclusive series and paid benefits
In Part 1, we explored how LLMs operate internally
In Part 2, we treated prompts as API contracts, chaining them into reliable mini-systems
If you haven’t gone through Part 1 and Part 2, I highly recommend going through it for a sequential learning process
But here’s the limitation:
LLMs are sealed in their training cut-off.
They don’t know your company’s internal docs, fresh GitHub issues, or yesterday’s incident logs.
To make GenAI useful beyond toy demos, you need to ground it in your own data.
That’s where Retrieval-Augmented Generation (RAG) comes in
1. What is RAG (and why it matters)?
At a high level:
Instead of asking the model directly:
“What does error ERR-504 mean?”
…you:
Search your documentation for that error.
Pull the matching snippet.
Inject it into the LLM’s prompt.
Get a grounded answer.
The LLM isn’t “remembering” ERR-504.
It’s just reading the retrieved context and summarizing.
Why engineers should care:
Freshness → No retraining needed for every data update
Scalability → You can index TBs of docs, but only fetch what matters
Reliability → Hallucinations drop when answers are anchored in context
Cost control → You inject 2–3 chunks instead of dumping the whole wiki into the prompt
RAG turns an LLM from a clever guesser into a grounded retrieval system
2. How retrieval works (the technical flow)
RAG pipelines have a standard structure:
Chunking
Split docs into smaller units (e.g., ~500–1000 tokens).
Embeddings are length-limited, and smaller chunks improve recall.
Trade-off: Too small = context loss; too large = wasted tokens.
Embedding
Convert text chunks into high-dimensional vectors.
Example: “Database connection failed” →
[0.23, -0.11, 0.56, …]
.Embedding models capture semantic similarity, not just keywords.
Indexing
Store vectors in a vector database: Pinecone, Weaviate, FAISS, Milvus, or pgvector.
These DBs optimize similarity search in high dimensions.
Retrieval
For a query, embed it → search the vector DB → return top-k similar chunks.
Augmentation
Insert retrieved chunks into the LLM prompt.
Example system prompt: