The Hustling Engineer

The Hustling Engineer

GenAI for Engineers (Part 3: Real-World Applications with RAG)

Hemant Pandey's avatar
Hemant Pandey
Sep 17, 2025
∙ Paid
16
4
Share
https://www.elastic.co/search-labs/blog/retrieval-augmented-generation-rag

Free Monthly Masterclasses for Paid Subscribers

I am launching a monthly Masterclass Series beginning in October

These sessions will feature senior engineers and industry leaders, offering exclusive insights and learnings. Access will be free for all paid subscribers

If you have been considering an upgrade to paid, now is the ideal time to join and secure access to this exclusive series and paid benefits

Upgrade to Paid


In Part 1, we explored how LLMs operate internally
In Part 2, we treated prompts as API contracts, chaining them into reliable mini-systems

If you haven’t gone through Part 1 and Part 2, I highly recommend going through it for a sequential learning process

But here’s the limitation:

  • LLMs are sealed in their training cut-off.

  • They don’t know your company’s internal docs, fresh GitHub issues, or yesterday’s incident logs.

To make GenAI useful beyond toy demos, you need to ground it in your own data.

That’s where Retrieval-Augmented Generation (RAG) comes in


1. What is RAG (and why it matters)?

At a high level:

From: https://gpt-index.readthedocs.io/en/latest/getting_started/concepts.html

Instead of asking the model directly:

“What does error ERR-504 mean?”

…you:

  1. Search your documentation for that error.

  2. Pull the matching snippet.

  3. Inject it into the LLM’s prompt.

  4. Get a grounded answer.

The LLM isn’t “remembering” ERR-504.

It’s just reading the retrieved context and summarizing.

Why engineers should care:

  • Freshness → No retraining needed for every data update

  • Scalability → You can index TBs of docs, but only fetch what matters

  • Reliability → Hallucinations drop when answers are anchored in context

  • Cost control → You inject 2–3 chunks instead of dumping the whole wiki into the prompt

RAG turns an LLM from a clever guesser into a grounded retrieval system


2. How retrieval works (the technical flow)

RAG pipelines have a standard structure:

  1. Chunking

    • Split docs into smaller units (e.g., ~500–1000 tokens).

    • Embeddings are length-limited, and smaller chunks improve recall.

    • Trade-off: Too small = context loss; too large = wasted tokens.

  2. Embedding

    • Convert text chunks into high-dimensional vectors.

    • Example: “Database connection failed” → [0.23, -0.11, 0.56, …].

    • Embedding models capture semantic similarity, not just keywords.

  3. Indexing

    • Store vectors in a vector database: Pinecone, Weaviate, FAISS, Milvus, or pgvector.

    • These DBs optimize similarity search in high dimensions.

  4. Retrieval

    • For a query, embed it → search the vector DB → return top-k similar chunks.

  5. Augmentation

    • Insert retrieved chunks into the LLM prompt.

    • Example system prompt:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Hemant Pandey
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture