Pre-training model...

Initializing...

microgpt.py Interactive Guide

An interactive exploration of @karpathy's complete GPT implementation in 200 lines of pure Python. Every concept below is live — adjust sliders, step through computations, train the model, and generate new names.

Model state: Pre-trained (300 steps)

1. Dataset & Tokenizer

The model learns from ~32,000 human names. Each character maps to a token ID (0–25 for a–z). A special BOS (Beginning/End of Sequence) token with ID 26 wraps every name.

Token Vocabulary

Try it — type a name:

Sample names from dataset

# Tokenizer (lines 23-27) uchars = sorted(set(''.join(docs))) BOS = len(uchars) vocab_size = len(uchars) + 1 # Tokenize a document tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

2. The Value Node & Autograd

Every scalar in the computation is a Value node that tracks how it was computed. This forms a directed acyclic graph. Backward propagates gradients from the output back through the graph using the chain rule.

class Value: def __init__(self, data, children=(), local_grads=()): self.data = data self.grad = 0 self._children = children self._local_grads = local_grads def backward(self): topo = [] visited = set() def build_topo(v): if v not in visited: visited.add(v) for child in v._children: build_topo(child) topo.append(v) build_topo(self) self.grad = 1 for v in reversed(topo): for child, local_grad in zip(v._children, v._local_grads): child.grad += local_grad * v.grad

3. Model Parameters

All knowledge is stored in weight matrices initialized with small random values. Click any matrix to inspect its values as a heatmap.

Total parameters: 0
# Initialize parameters (lines 74-89) n_embd = 16; n_head = 4; n_layer = 1; block_size = 16 matrix = lambda nout, nin, std=0.08: [[Value(gauss(0,std)) ...]] state_dict = { 'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd), 'layer0.attn_wq/wk/wv/wo': matrix(n_embd, n_embd), 'layer0.mlp_fc1': matrix(4*n_embd, n_embd), 'layer0.mlp_fc2': matrix(n_embd, 4*n_embd), }

4. The GPT Architecture

The gpt() function processes one token at a time, using a KV-cache to remember previous tokens. Step through the forward pass to see each operation and its intermediate values.

1 / 6
Click Next to step through the forward pass.
def gpt(token_id, pos_id, keys, values): tok_emb = state_dict['wte'][token_id] pos_emb = state_dict['wpe'][pos_id] x = [t + p for t, p in zip(tok_emb, pos_emb)] x = rmsnorm(x) for li in range(n_layer): # Multi-head attention q = linear(x, attn_wq); k = ...; v = ... # ... attention computation ... x = linear(x_attn, attn_wo) x = [a + b for a, b in zip(x, x_residual)] # MLP block x = linear(rmsnorm(x), mlp_fc1) x = [xi.relu() for xi in x] x = linear(x, mlp_fc2) x = [a + b for a, b in zip(x, x_residual)] return linear(x, lm_head)

5. Attention Deep Dive

Multi-head attention lets each position attend to all previous positions. Each of the 4 heads can learn different patterns. The heatmap shows attention weights — brighter means stronger attention.

# Attention (lines 124-132) for h in range(n_head): q_h = q[hs:hs+head_dim] k_h = [ki[hs:hs+head_dim] for ki in keys[li]] v_h = [vi[hs:hs+head_dim] for vi in values[li]] attn_logits = [sum(q_h[j] * k_h[t][j] ...) / head_dim**0.5 for t in range(len(k_h))] attn_weights = softmax(attn_logits) head_out = [sum(attn_weights[t] * v_h[t][j] ...) for j in range(head_dim)]

6. Training Loop

Each step: pick a name, predict every next character, compute cross-entropy loss, backpropagate gradients, and update weights with Adam. Watch the loss decrease as the model learns.

Step: 0
Loss:
LR: 0.0100

Current Document

No training step yet.

Loss Curve

# Training loop (lines 153-184) for step in range(num_steps): doc = docs[step % len(docs)] tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS] # Forward: compute loss for pos_id in range(n): logits = gpt(token_id, pos_id, keys, values) probs = softmax(logits) loss_t = -probs[target_id].log() loss = (1/n) * sum(losses) # Backward loss.backward() # Adam update for i, p in enumerate(params): m[i] = beta1*m[i] + (1-beta1)*p.grad v[i] = beta2*v[i] + (1-beta2)*p.grad**2 p.data -= lr * m_hat / (v_hat**0.5 + eps)

7. Inference / Generation

Generate new names by feeding BOS, then sampling tokens one at a time. Temperature controls randomness — low = conservative, high = creative.

0.5
# Inference (lines 186-200) temperature = 0.5 for sample_idx in range(20): token_id = BOS for pos_id in range(block_size): logits = gpt(token_id, pos_id, keys, values) probs = softmax([l / temperature for l in logits]) token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0] if token_id == BOS: break sample.append(uchars[token_id])