Before we start, I would just like to signal boost my new digital product, “LinkedIn Playbook for Engineers & Founders.” You get a 10% discount for being a newsletter subscriber, which you can avail of using this link
This is the 1st part of the 4-part guide on “GenAI for Engineers”
Intro
Everyone’s talking about GenAI
Most demos look magical
But once you peek under the hood, you realize: it’s not magic, it’s just statistics, probability, and a ton of GPUs.
I am writing a 4-part guide that will take you from knowing how to call an API → designing GenAI-powered systems you’d trust in production.
This is not for a non-technical person to be honest, there are enough resources on the internet which you can use for writing better prompts, building agents etc. It is for software engineers who are interested in foundations of LLM
I am not going to cover the very basics of what LLMs are and what GenAI is, etc. If you feel it is too complex, pause and use ChatGPT to explain it to you
Let’s start with Part 1: foundations in this newsletter
1. How LLMs actually “think”
At the core of modern GenAI are Large Language Models (LLMs).
They don’t have knowledge graphs in their heads.
They don’t “reason” like humans.
They are giant next-token predictors trained on massive datasets.
When you prompt a model, it’s basically asking:
“Given all the text I’ve seen during training, what is the most likely next token?”
Example:
Input: “I’m going to make scrambled …”
Possible outputs:
eggs (highest probability)
tofu (lower, but still likely)
rats (technically possible if the dataset included weird jokes, but with very little probability).
That’s it. That’s the entire mechanism.
The power comes from scale:
Billions of parameters + trillions of tokens = surprisingly “intelligent” behavior
👉 For a visual and a more detailed intuition, I highly recommend reading The Illustrated Transformer.
Key takeaway:
LLMs don’t “know” things. They recognize patterns. Which means:
They’ll surprise you with smart answers.
They’ll also confidently make stuff up.
2. The components of an LLM request
When you send a request to an API (like OpenAI, Anthropic, or Gemini), here’s what’s happening:
Prompt → your input.
Example: “Summarize this error log in plain English.”
Context window → the text the model can “see” at once.
GPT-3.5: ~16k tokens.
GPT-4: up to 128k tokens.
Once you hit the limit, old input gets chopped.
Completion → the model’s generated output.
Parameters you control:
1. temperature
What it does: Controls the randomness of the model’s output.
How it works: The model assigns probabilities to possible next tokens. Temperature scales these probabilities before sampling.
Ranges & Effects:
0
→ deterministic (always picks the highest probability token). Good for math, logic, structured Q&A.0.3–0.5
→ low randomness. Slight variation, still mostly predictable. Good for coding, factual answers.0.7
→ moderate randomness. Balanced between creativity and coherence. Good for brainstorming or storytelling.1.0+
→ very random. Can produce surprising or creative outputs, but is less reliable.
Rule of thumb: Use lower values for accuracy, higher for creativity.
2. max_tokens
What it does: Sets the maximum length of the model’s response, measured in tokens (≈ 4 characters in English on average).
Why it matters:
Prevents overly long or runaway responses.
Helps control costs and latency (since tokens = compute).
Tips:
Set high enough so the model can complete an idea.
For chat-like answers: 256–512 tokens.
For long essays, code, or deep analysis: 1,000–2,000+.
3. top_p (nucleus sampling)
What it does: Another randomness control. Instead of temperature scaling, it restricts sampling to the smallest set of tokens whose cumulative probability ≥ top_p.
Ranges & Effects:
1.0
→ no restriction (default).0.9
→ trims off unlikely options, keeps output coherent.0.7
→ narrows even more, reducing diversity.
Key note: Don’t over-adjust both
temperature
andtop_p
. Usually set one and leave the other at the default.