microgpt.py Interactive Guide
An interactive exploration of @karpathy's complete GPT implementation in 200 lines of pure Python. Every concept below is live — adjust sliders, step through computations, train the model, and generate new names.
1. Dataset & Tokenizer
The model learns from ~32,000 human names. Each character maps to a token ID (0–25 for a–z). A special BOS (Beginning/End of Sequence) token with ID 26 wraps every name.
Token Vocabulary
Try it — type a name:
Sample names from dataset
2. The Value Node & Autograd
Every scalar in the computation is a Value node that tracks how it was computed. This forms a directed acyclic graph. Backward propagates gradients from the output back through the graph using the chain rule.
3. Model Parameters
All knowledge is stored in weight matrices initialized with small random values. Click any matrix to inspect its values as a heatmap.
4. The GPT Architecture
The gpt() function processes one token at a time, using a KV-cache to remember previous tokens. Step through the forward pass to see each operation and its intermediate values.
5. Attention Deep Dive
Multi-head attention lets each position attend to all previous positions. Each of the 4 heads can learn different patterns. The heatmap shows attention weights — brighter means stronger attention.
6. Training Loop
Each step: pick a name, predict every next character, compute cross-entropy loss, backpropagate gradients, and update weights with Adam. Watch the loss decrease as the model learns.
Current Document
Loss Curve
7. Inference / Generation
Generate new names by feeding BOS, then sampling tokens one at a time. Temperature controls randomness — low = conservative, high = creative.