The Hustling Engineer

The Hustling Engineer

GenAI for Engineers (Part 1: The Foundations)

Hemant Pandey's avatar
Hemant Pandey
Sep 10, 2025
∙ Paid
20
4
Share

Before we start, I would just like to signal boost my new digital product, “LinkedIn Playbook for Engineers & Founders.” You get a 10% discount for being a newsletter subscriber, which you can avail of using this link

This is the 1st part of the 4-part guide on “GenAI for Engineers”

Intro

Everyone’s talking about GenAI
Most demos look magical
But once you peek under the hood, you realize: it’s not magic, it’s just statistics, probability, and a ton of GPUs.

I am writing a 4-part guide that will take you from knowing how to call an API → designing GenAI-powered systems you’d trust in production.

This is not for a non-technical person to be honest, there are enough resources on the internet which you can use for writing better prompts, building agents etc. It is for software engineers who are interested in foundations of LLM

I am not going to cover the very basics of what LLMs are and what GenAI is, etc. If you feel it is too complex, pause and use ChatGPT to explain it to you

Let’s start with Part 1: foundations in this newsletter


1. How LLMs actually “think”

At the core of modern GenAI are Large Language Models (LLMs).

They don’t have knowledge graphs in their heads.
They don’t “reason” like humans.
They are giant next-token predictors trained on massive datasets.

When you prompt a model, it’s basically asking:

“Given all the text I’ve seen during training, what is the most likely next token?”

Example:
Input: “I’m going to make scrambled …”
Possible outputs:

  • eggs (highest probability)

  • tofu (lower, but still likely)

  • rats (technically possible if the dataset included weird jokes, but with very little probability).

That’s it. That’s the entire mechanism.

The power comes from scale:

Billions of parameters + trillions of tokens = surprisingly “intelligent” behavior

👉 For a visual and a more detailed intuition, I highly recommend reading The Illustrated Transformer.

Key takeaway:

LLMs don’t “know” things. They recognize patterns. Which means:

  • They’ll surprise you with smart answers.

  • They’ll also confidently make stuff up.


2. The components of an LLM request

When you send a request to an API (like OpenAI, Anthropic, or Gemini), here’s what’s happening:

  1. Prompt → your input.

    • Example: “Summarize this error log in plain English.”

  2. Context window → the text the model can “see” at once.

    • GPT-3.5: ~16k tokens.

    • GPT-4: up to 128k tokens.

    • Once you hit the limit, old input gets chopped.

  3. Completion → the model’s generated output.

  4. Parameters you control:

    1. temperature

    • What it does: Controls the randomness of the model’s output.

    • How it works: The model assigns probabilities to possible next tokens. Temperature scales these probabilities before sampling.

    • Ranges & Effects:

      • 0 → deterministic (always picks the highest probability token). Good for math, logic, structured Q&A.

      • 0.3–0.5 → low randomness. Slight variation, still mostly predictable. Good for coding, factual answers.

      • 0.7 → moderate randomness. Balanced between creativity and coherence. Good for brainstorming or storytelling.

      • 1.0+ → very random. Can produce surprising or creative outputs, but is less reliable.

    Rule of thumb: Use lower values for accuracy, higher for creativity.


    2. max_tokens

    • What it does: Sets the maximum length of the model’s response, measured in tokens (≈ 4 characters in English on average).

    • Why it matters:

      • Prevents overly long or runaway responses.

      • Helps control costs and latency (since tokens = compute).

    • Tips:

      • Set high enough so the model can complete an idea.

      • For chat-like answers: 256–512 tokens.

      • For long essays, code, or deep analysis: 1,000–2,000+.


    3. top_p (nucleus sampling)

    • What it does: Another randomness control. Instead of temperature scaling, it restricts sampling to the smallest set of tokens whose cumulative probability ≥ top_p.

    • Ranges & Effects:

      • 1.0 → no restriction (default).

      • 0.9 → trims off unlikely options, keeps output coherent.

      • 0.7 → narrows even more, reducing diversity.

      Key note: Don’t over-adjust both temperature and top_p. Usually set one and leave the other at the default.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Hemant Pandey
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture