Practices/ First Principles
10 papers · 10 units
Pradhya Practice 00 · First Principles of Modern AI Free · Primer

The first principles of modern AI.

Every system you use today — chatbots, copilots, image models — stands on ten load-bearing papers published between 2017 and 2022. Strip away the hype and each one is a single, almost simple principle. This is the theory primer behind every other practice in the catalog.

Each unit is one paper: the principle, why it mattered, and an animated deep-dive diagram of the actual mechanism — real matrices, real exponents, real pipelines. Diagrams draw themselves as you scroll. Where a great live explainer exists, it’s linked under the figure.

For whom
Anyone who wants to know why any of this works — no math required
Length
1 session · ~60 min
You'll walk away with
A mental map of the ten ideas every AI product is built on
Prereq
None — read this before everything else
What you’ll be able to do by the end
  • Name the one principle each landmark paper contributed — in a sentence each
  • Read an architecture diagram (attention, RAG, diffusion, RLHF) and know what every box does
  • Explain why scale — not architectural cleverness — drove the last five years
  • Follow any paper-trail callout in the other practices back to its source
§ 00.01 · Paper 01 — 2017

Global parallelism beats sequential processing.

Attention Is All You Need · Vaswani et al. · arXiv 1706.03762 ↗

Before the Transformer, models read text the way you do — one word at a time, each step waiting on the last. Self-attention removed the queue: every token looks at every other token simultaneously, no matter how far apart. The path between any two positions becomes constant, the whole computation becomes parallel, and suddenly scale is an engineering problem instead of an architectural one.

① TOKENS ② EMBED ③ PROJECT → Q · K · V ④ SCORE + SOFTMAX ⑤ MIX VALUES the cat sat x ∈ ℝ⁵¹² (5 dims shown) × Wq × Wk × Wv Q K V A = softmax(Q·Kᵀ / √dₖ) the cat sat the cat sat .62.27.11 .15.70.15 .08.35.57 row “cat” = where cat looks: 70% itself, 15% “the”, 15% “sat” — every row sums to 1 × V Z = A · V context-mixed vectors × 8 heads, in parallel Three matrix multiplies. Every token is updated by every other token, all at once — there is no recurrence anywhere in the building. sequential ops per layer: O(1)
Fig 01 — Anatomy of one attention head: embed → project → score → softmax → mix. The 3×3 matrix is the whole trick — each row is one token deciding where to look.
Go deeper: play with the live Transformer Explainer (Polo Club) ↗
§ 00.02 · Paper 02 — 2018

Meaning is bidirectional.

BERT: Pre-training of Deep Bidirectional Transformers · Devlin et al. · arXiv 1810.04805 ↗

Left-to-right models only ever see half the story. BERT's move: hide a word, then force the model to reconstruct it using context from both directions at once — the Cloze task. A model trained this way builds deep representations of what words mean in situ, and those representations fine-tune to almost any language task with barely any new architecture.

[CLS] the [MASK] sat on the mat [SEP] + position emb 0…7 encoder layer 1 — attn + FFN encoder layer 2 — attn + FFN encoder layer 12 12 LAYERS · EVERY TOKEN SEES EVERY TOKEN GPT — CAUSAL MASK BERT — NO MASK sees only the past sees both directions rows = queries, columns = keys — the mask is the only difference SOFTMAX OVER 30,522-WORD VOCAB cat.82 dog.07 kitten.05 mat.03 pizza.01 Self-supervision at scale: 3.3B words of free text, 15% hidden — no labels, no annotators. The text grades itself. fine-tuning: + one output layer per task
Fig 02 — BERT anatomy: same Transformer, different mask. Deleting the causal mask is what makes the representation bidirectional — and the Cloze task is what makes that trainable.
Go deeper: The Illustrated BERT (Jay Alammar) ↗
§ 00.03 · Paper 03 — 2020

Loss falls as a power law in compute, data, and parameters.

Scaling Laws for Neural Language Models · Kaplan et al. · arXiv 2001.08361 ↗

Performance isn't a mystery — it's a curve. Error rate is governed by three macroscopic dials: parameters, dataset size, and compute. Scale them together and loss drops along a straight line on log-log paper, predictably, across seven orders of magnitude. The shock: micro-architecture (depth vs. width, head counts) barely moves the line. Scale is the signal; almost everything else is noise.

C — COMPUTE (PF-DAYS) D — DATASET SIZE (TOKENS) N — PARAMETERS TEST LOSS (LOG) L ∝ C^−0.050 L ∝ D^−0.095 L ∝ N^−0.076 compute (log) → tokens (log) → params (log) → PUT TOGETHER — THE COMPUTE FRONTIER 1M 10M 100M 1B 10B params the frontier is a straight power law bigger models: more sample-efficient compute (log) → each gray curve = one model size trained longer · depth, width, heads — almost irrelevant once N, D, C are fixed
Fig 03 — The measured exponents: α_C ≈ 0.050, α_D ≈ 0.095, α_N ≈ 0.076. Loss is a budget line — GPT-3 was sized off these curves before it was trained.
§ 00.04 · Paper 04 — 2020

Separate knowing from reasoning.

Retrieval-Augmented Generation for Knowledge-Intensive NLP · Lewis et al. · arXiv 2005.11401 ↗

Why force a network to memorize the world inside frozen weights? RAG splits memory in two: parametric (the model's learned ability to reason and write) and non-parametric (an external, searchable index of documents). A retriever pulls the relevant pages at question-time and the generator conditions on them — answers get more factual and specific, hallucinations drop, and you can update the library without retraining the brain.

QUESTION “who wrote Hamlet?” query encoder — BERTq q ∈ ℝ⁷⁶⁸ VECTOR SPACE — 21M WIKI PASSAGES q TOP-K BOUNDARY .91 .87 .79 MIPS — NEAREST NEIGHBORS BY INNER PRODUCT z₁ · 0.91 z₂ · 0.87 z₃ · 0.79 [ q ; zᵢ ] generatorseq2seq LLM William Shakespeare ¶ cited from z₁ p(y|x) = Σz p(z|x) · p(y|x,z) ↑ retriever ↑ generator The answer is marginalized over documents — a weighted vote across what was retrieved. Uncertainty about sources is part of the math, not an afterthought. update the index, not the weights
Fig 04 — RAG anatomy: dense retrieval is just nearest-neighbor search in embedding space; generation conditions on (and votes across) the top-k hits.
§ 00.05 · Paper 05 — 2020

At scale, the prompt becomes the program.

Language Models are Few-Shot Learners (GPT-3) · Brown et al. · arXiv 2005.14165 ↗

Train a 175-billion-parameter model on one dumb objective — predict the next word — and something strange emerges: in-context learning. Show it two examples of a brand-new task inside the prompt and it performs the task, with zero gradient updates. No fine-tuning, no new weights. Pattern completion at sufficient scale starts to look like general-purpose adaptation.

ZERO-SHOT 0 examples Translate English to French: cheese → ONE-SHOT 1 example sea otter → loutre de mer cheese → FEW-SHOT K examples sea otter → loutre de mer peppermint → menthe poivrée plush giraffe → girafe en peluche cheese → fromage same frozen 175B weights for all three — only the prompt changes ACCURACY vs EXAMPLES IN CONTEXT 0141664 examples in prompt, K (log) 0204060 ACCURACY (%) 175B 13B 1.3B in-context learning — emergent with scale The big model reads the examples and infers the task; the small ones just see more text. Same architecture, same objective — the capability appears with scale. zero gradient updates — activations do the learning
Fig 05 — GPT-3 anatomy: the prompt format IS the task specification. Accuracy climbs with in-context examples — but only once the model is big enough to use them.
Go deeper: How GPT-3 Works — animated (Jay Alammar) ↗
§ 00.06 · Paper 06 — 2021

Language is the supervisor.

CLIP: Learning Transferable Visual Models From Natural Language Supervision · Radford et al. · arXiv 2103.00020 ↗

Classic vision models memorized 1,000 rigid human-labeled categories. CLIP replaces the label sheet with the internet's captions: train an image encoder and a text encoder jointly so that matching image–text pairs pull together and mismatched pairs push apart in one shared space. The result recognizes practically any concept you can phrase — zero-shot, no retraining.

TRAINING — ONE BATCH, 400M PAIRS TOTAL image encoderViT / ResNet I₁ ▮▮▯▮ I₂ ▮▯▮▮ I₃ ▯▮▮▯ “pelican over surf” “a tabby cat” “red tractor” text encoderTransformer T₁ ▮▯▮▮ T₂ ▮▮▯▯ T₃ ▯▮▯▮ T₁ T₂ T₃ I₁ I₂ I₃ .92.18.08 .12.88.15 .07.21.90 ↖ maximize the diagonal minimize everything else InfoNCE — CROSS-ENTROPY ON ROWS + COLUMNS · TEMP τ INFERENCE — A CLASSIFIER BUILT OUT OF TEXT, AT RUNTIME new image image encoder “a photo of a pelican” “a photo of a cat” “a photo of a tractor” PROMPT TEMPLATE × YOUR LABEL SET — ANY LABELS, ANY TIME text encoder cos( image , text ) pelican.84 ✓ cat.11 tractor.05 argmax → “pelican” · zero-shot Swap the words and you have built a new classifier — no retraining, no label sheet. Vision inherits the open vocabulary of language. one shared embedding space for pixels and words
Fig 06 — CLIP anatomy: contrastive training builds the shared space; the zero-shot recipe — embed your labels as sentences, take the nearest — is what made it famous.
§ 00.07 · Paper 07 — 2021

Adaptation lives in a low-rank subspace.

LoRA: Low-Rank Adaptation of Large Language Models · Hu et al. · arXiv 2106.09685 ↗

Teaching a giant model a new task doesn't rewrite its knowledge — the weight change during adaptation has low intrinsic rank. So: freeze the original matrix, inject two tiny trainable matrices whose product approximates the update. Trainable parameters drop ~10,000×, GPU memory ~3×, and at inference the update merges back in — zero added latency.

ONE TRANSFORMER BLOCK SELF-ATTENTION Wq Wk Wv Wo + B·A + B·A FEED-FORWARD — left frozen, no adapter × 96 BLOCKS IN GPT-3 adapting only Wq and Wv is enough THE FORWARD PASS — h = W₀x + B·A·x x W₀ ❄ 4096 × 4096 · FROZEN A 8 × 4096 B 4096 × 8 · init 0 + h B starts at zero, so ΔW = 0 at step one — training begins exactly at the pretrained model. gradients flow only into A and B THE ARITHMETIC PER MATRIX full ΔW: 16.8M B·A at r=8: 65.5K GPT-3 175B, ALL LAYERS trainable: ~4.7M (0.003%) checkpoint: 350GB → 35MB AT INFERENCE merge: W ← W₀ + B·A added latency: zero task A task B task C SWAP ADAPTERS, NOT MODELS Fine-tuning never rewrote the knowledge — it nudged it in a few directions. LoRA just writes the nudge in its natural, low-rank coordinates. r ≈ 1–8 is usually enough
Fig 07 — LoRA anatomy: a 4096×4096 update expressed as (4096×8)·(8×4096). Two skinny matrices carry the whole adaptation.
§ 00.08 · Paper 08 — 2022

Compress first, then create.

High-Resolution Image Synthesis with Latent Diffusion Models · Rombach et al. · arXiv 2112.10752 ↗

Diffusion models generate by reversing noise — agonizingly slow in raw pixel space, where most bits encode imperceptible detail. The fix: an autoencoder first squeezes the image into a small latent space that keeps the semantics and discards the imperceptible; diffusion then runs entirely in that compressed space, and a decoder restores full resolution at the end. Same quality, a fraction of the compute — this is the architecture behind Stable Diffusion.

① PERCEPTUAL COMPRESSION 512×512×3 786,432 values 64×64×4 16,384 — 48× less keep the composition and meaning; throw away what eyes can’t see. “a cat in a hat” frozen text encoder ② REVERSE DIFFUSION — ENTIRELY IN LATENT SPACE t = 1000t = 750t = 500t = 250t = 0 U-Net εθ — predicts the noise to remove, every step ≈ 50 steps (DDIM) cross-attention: Q from latent · K,V from text 𝒟 512² OUT Diffusion never touches a pixel: all ~50 denoising steps run in the 48×-smaller latent. That is why it runs on a gaming GPU — and it is the architecture behind Stable Diffusion. the prompt steers every step via cross-attention
Fig 08 — Latent diffusion anatomy: compress (ℰ), denoise under text guidance (U-Net + cross-attention), decode (𝒟).
Go deeper: play with the live Diffusion Explainer (Polo Club) ↗
§ 00.09 · Paper 09 — 2022

Reasoning needs room to compute.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models · Wei et al. · arXiv 2201.11903 ↗

A model that must answer in one breath gets hard problems wrong — there's nowhere to put the intermediate work. Chain-of-thought makes the model write the steps out loud, and each generated token is extra computation spent on the problem. Decomposition isn't a presentation choice; it's a compute budget — and with it, reasoning abilities emerge in large models that simply weren't there before.

WHERE THE COMPUTE LIVES — EVERY OUTPUT TOKEN = ONE FULL FORWARD PASS DIRECT 23 × 17 = ? 2 1 1 3 passes “211” ✗ CHAIN OF THOUGHT 23×20 = 460 23×3 = 69 460−69 = 391 9+ passes 391 ✓ each written token buys one more pass through the entire network — serial depth you cannot get any other way GSM8K MATH WORD PROBLEMS — SOLVE RATE vs MODEL SCALE 0204060 0.4B8B62B540B model parameters (log) → 17.9% 56.9% chain-of-thought standard prompting flat, flat, flat — then the curve breaks upward. small models produce fluent nonsense when asked to reason; past a threshold, the steps start to bind. reasoning is emergent — unlocked by a prompt Same weights, same question — the only change is permission to think out loud. Every modern “reasoning model” is this principle, industrialized.
Fig 09 — CoT anatomy: 3× the tokens ≈ 3× the serial compute, and on PaLM 540B that turns 17.9% into 56.9% on grade-school math.
§ 00.10 · Paper 10 — 2022

Align the objective to the intent.

Training Language Models to Follow Instructions with Human Feedback (InstructGPT) · Ouyang et al. · arXiv 2203.02155 ↗

"Predict the next word on the internet" is not the same goal as "be helpful, honest, and harmless." RLHF closes that gap: humans rank the model's outputs, a reward model learns to score like those humans, and reinforcement learning then steers the LM to maximize that learned reward. The optimization target stops being the internet's average and starts being what people actually wanted.

① SFT — SHOW IT “explain the moon landing to a 6-year-old” a labeler writes the ideal answer SUPERVISED FINE-TUNE SFT model GPT-3 + good manners 13K DEMONSTRATION PROMPTS ② REWARD MODEL — SCORE IT output A output B output C output D THE SFT MODEL SAMPLES 4–9 ANSWERS D ≻ C ≻ A ≻ B RANKS, NOT SCORES — EASIER TO JUDGE reward model r(x, y) a scorer that predicts human taste loss = −log σ( r(D) − r(B) ) 33K PROMPTS, PAIRWISE COMPARISONS ③ PPO — OPTIMIZE IT policy π initialized from the SFT model response y RM scores it → r PPO update ↺ KL penalty — don’t drift far from SFT the policy chases the learned reward, leashed to the model people approved of 31K PROMPTS, RL ONLY — NO NEW LABELS HUMAN EVALS — WHICH OUTPUT DO PEOPLE PREFER? GPT-3 175B InstructGPT 1.3B preferred — at 1/100th the size alignment beat a 100× scale advantage
Fig 10 — The real pipeline: SFT (13k demos) → reward model (33k rankings) → PPO with a KL leash (31k prompts). Every chat assistant you use descends from this diagram.