Pradhya Practice 00 · First Principles of Modern AI Free · Primer

The first principles of modern AI.

Every system you use today — chatbots, copilots, image models — stands on ten load-bearing papers published between 2017 and 2022. Strip away the hype and each one is a single, almost simple principle. This is the theory primer behind every other practice in the catalog.

Each unit is one paper: the principle, why it mattered, and an animated deep-dive diagram of the actual mechanism — real matrices, real exponents, real pipelines. Diagrams draw themselves as you scroll. Where a great live explainer exists, it’s linked under the figure.

For whom

Anyone who wants to know why any of this works — no math required

Length

1 session · ~60 min

You'll walk away with

A mental map of the ten ideas every AI product is built on

Prereq

None — read this before everything else

What you’ll be able to do by the end

Name the one principle each landmark paper contributed — in a sentence each
Read an architecture diagram (attention, RAG, diffusion, RLHF) and know what every box does
Explain why scale — not architectural cleverness — drove the last five years
Follow any paper-trail callout in the other practices back to its source

§ 00.01 · Paper 01 — 2017

Global parallelism beats sequential processing.

Attention Is All You Need · Vaswani et al. · arXiv 1706.03762 ↗

Before the Transformer, models read text the way you do — one word at a time, each step waiting on the last. Self-attention removed the queue: every token looks at every other token simultaneously, no matter how far apart. The path between any two positions becomes constant, the whole computation becomes parallel, and suddenly scale is an engineering problem instead of an architectural one.

Fig 01 — Anatomy of one attention head: embed → project → score → softmax → mix. The 3×3 matrix is the whole trick — each row is one token deciding where to look.
Go deeper: play with the live Transformer Explainer (Polo Club) ↗

§ 00.02 · Paper 02 — 2018

Meaning is bidirectional.

BERT: Pre-training of Deep Bidirectional Transformers · Devlin et al. · arXiv 1810.04805 ↗

Left-to-right models only ever see half the story. BERT's move: hide a word, then force the model to reconstruct it using context from both directions at once — the Cloze task. A model trained this way builds deep representations of what words mean in situ, and those representations fine-tune to almost any language task with barely any new architecture.

Fig 02 — BERT anatomy: same Transformer, different mask. Deleting the causal mask is what makes the representation bidirectional — and the Cloze task is what makes that trainable.
Go deeper: The Illustrated BERT (Jay Alammar) ↗

§ 00.03 · Paper 03 — 2020

Loss falls as a power law in compute, data, and parameters.

Scaling Laws for Neural Language Models · Kaplan et al. · arXiv 2001.08361 ↗

Performance isn't a mystery — it's a curve. Error rate is governed by three macroscopic dials: parameters, dataset size, and compute. Scale them together and loss drops along a straight line on log-log paper, predictably, across seven orders of magnitude. The shock: micro-architecture (depth vs. width, head counts) barely moves the line. Scale is the signal; almost everything else is noise.

Fig 03 — The measured exponents: α_C ≈ 0.050, α_D ≈ 0.095, α_N ≈ 0.076. Loss is a budget line — GPT-3 was sized off these curves before it was trained.

§ 00.04 · Paper 04 — 2020

Separate knowing from reasoning.

Retrieval-Augmented Generation for Knowledge-Intensive NLP · Lewis et al. · arXiv 2005.11401 ↗

Why force a network to memorize the world inside frozen weights? RAG splits memory in two: parametric (the model's learned ability to reason and write) and non-parametric (an external, searchable index of documents). A retriever pulls the relevant pages at question-time and the generator conditions on them — answers get more factual and specific, hallucinations drop, and you can update the library without retraining the brain.

Fig 04 — RAG anatomy: dense retrieval is just nearest-neighbor search in embedding space; generation conditions on (and votes across) the top-k hits.

§ 00.05 · Paper 05 — 2020

At scale, the prompt becomes the program.

Language Models are Few-Shot Learners (GPT-3) · Brown et al. · arXiv 2005.14165 ↗

Train a 175-billion-parameter model on one dumb objective — predict the next word — and something strange emerges: in-context learning. Show it two examples of a brand-new task inside the prompt and it performs the task, with zero gradient updates. No fine-tuning, no new weights. Pattern completion at sufficient scale starts to look like general-purpose adaptation.

Fig 05 — GPT-3 anatomy: the prompt format IS the task specification. Accuracy climbs with in-context examples — but only once the model is big enough to use them.
Go deeper: How GPT-3 Works — animated (Jay Alammar) ↗

§ 00.06 · Paper 06 — 2021

Language is the supervisor.

CLIP: Learning Transferable Visual Models From Natural Language Supervision · Radford et al. · arXiv 2103.00020 ↗

Classic vision models memorized 1,000 rigid human-labeled categories. CLIP replaces the label sheet with the internet's captions: train an image encoder and a text encoder jointly so that matching image–text pairs pull together and mismatched pairs push apart in one shared space. The result recognizes practically any concept you can phrase — zero-shot, no retraining.

Fig 06 — CLIP anatomy: contrastive training builds the shared space; the zero-shot recipe — embed your labels as sentences, take the nearest — is what made it famous.

§ 00.07 · Paper 07 — 2021

Adaptation lives in a low-rank subspace.

LoRA: Low-Rank Adaptation of Large Language Models · Hu et al. · arXiv 2106.09685 ↗

Teaching a giant model a new task doesn't rewrite its knowledge — the weight change during adaptation has low intrinsic rank. So: freeze the original matrix, inject two tiny trainable matrices whose product approximates the update. Trainable parameters drop ~10,000×, GPU memory ~3×, and at inference the update merges back in — zero added latency.

Fig 07 — LoRA anatomy: a 4096×4096 update expressed as (4096×8)·(8×4096). Two skinny matrices carry the whole adaptation.

§ 00.08 · Paper 08 — 2022

Compress first, then create.

High-Resolution Image Synthesis with Latent Diffusion Models · Rombach et al. · arXiv 2112.10752 ↗

Diffusion models generate by reversing noise — agonizingly slow in raw pixel space, where most bits encode imperceptible detail. The fix: an autoencoder first squeezes the image into a small latent space that keeps the semantics and discards the imperceptible; diffusion then runs entirely in that compressed space, and a decoder restores full resolution at the end. Same quality, a fraction of the compute — this is the architecture behind Stable Diffusion.

Fig 08 — Latent diffusion anatomy: compress (ℰ), denoise under text guidance (U-Net + cross-attention), decode (𝒟).
Go deeper: play with the live Diffusion Explainer (Polo Club) ↗

§ 00.09 · Paper 09 — 2022

Reasoning needs room to compute.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models · Wei et al. · arXiv 2201.11903 ↗

A model that must answer in one breath gets hard problems wrong — there's nowhere to put the intermediate work. Chain-of-thought makes the model write the steps out loud, and each generated token is extra computation spent on the problem. Decomposition isn't a presentation choice; it's a compute budget — and with it, reasoning abilities emerge in large models that simply weren't there before.

Fig 09 — CoT anatomy: 3× the tokens ≈ 3× the serial compute, and on PaLM 540B that turns 17.9% into 56.9% on grade-school math.

§ 00.10 · Paper 10 — 2022

Align the objective to the intent.

Training Language Models to Follow Instructions with Human Feedback (InstructGPT) · Ouyang et al. · arXiv 2203.02155 ↗

"Predict the next word on the internet" is not the same goal as "be helpful, honest, and harmless." RLHF closes that gap: humans rank the model's outputs, a reward model learns to score like those humans, and reinforcement learning then steers the LM to maximize that learned reward. The optimization target stops being the internet's average and starts being what people actually wanted.

Fig 10 — The real pipeline: SFT (13k demos) → reward model (33k rankings) → PPO with a KL leash (31k prompts). Every chat assistant you use descends from this diagram.

← All practices The full catalog Next practice → Under the Hood — how LLMs actually work