microgpt.py Interactive Guide

An interactive exploration of @karpathy's complete GPT implementation in 200 lines of pure Python. Every concept below is live — adjust sliders, step through computations, train the model, and generate new names.

Model state: Pre-trained (300 steps)

1. Dataset & Tokenizer

The model learns from ~32,000 human names. Each character maps to a token ID (0–25 for a–z). A special BOS (Beginning/End of Sequence) token with ID 26 wraps every name.

Token Vocabulary

Try it — type a name:

Sample names from dataset

# Tokenizer (lines 23-27)
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

# Tokenize a document
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

2. The Value Node & Autograd

Every scalar in the computation is a Value node that tracks how it was computed. This forms a directed acyclic graph. Backward propagates gradients from the output back through the graph using the chain rule.

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

3. Model Parameters

All knowledge is stored in weight matrices initialized with small random values. Click any matrix to inspect its values as a heatmap.

Total parameters: 0

# Initialize parameters (lines 74-89)
n_embd = 16; n_head = 4; n_layer = 1; block_size = 16
matrix = lambda nout, nin, std=0.08: [[Value(gauss(0,std)) ...]]
state_dict = {
  'wte': matrix(vocab_size, n_embd),
  'wpe': matrix(block_size, n_embd),
  'lm_head': matrix(vocab_size, n_embd),
  'layer0.attn_wq/wk/wv/wo': matrix(n_embd, n_embd),
  'layer0.mlp_fc1': matrix(4*n_embd, n_embd),
  'layer0.mlp_fc2': matrix(n_embd, 4*n_embd),
}

4. The GPT Architecture

The gpt() function processes one token at a time, using a KV-cache to remember previous tokens. Step through the forward pass to see each operation and its intermediate values.

Input word: Position:

1 / 6

Click Next to step through the forward pass.

def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id]
    pos_emb = state_dict['wpe'][pos_id]
    x = [t + p for t, p in zip(tok_emb, pos_emb)]
    x = rmsnorm(x)
    for li in range(n_layer):
        # Multi-head attention
        q = linear(x, attn_wq); k = ...; v = ...
        # ... attention computation ...
        x = linear(x_attn, attn_wo)
        x = [a + b for a, b in zip(x, x_residual)]
        # MLP block
        x = linear(rmsnorm(x), mlp_fc1)
        x = [xi.relu() for xi in x]
        x = linear(x, mlp_fc2)
        x = [a + b for a, b in zip(x, x_residual)]
    return linear(x, lm_head)

5. Attention Deep Dive

Multi-head attention lets each position attend to all previous positions. Each of the 4 heads can learn different patterns. The heatmap shows attention weights — brighter means stronger attention.

Input word:

# Attention (lines 124-132)
for h in range(n_head):
    q_h = q[hs:hs+head_dim]
    k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
    v_h = [vi[hs:hs+head_dim] for vi in values[li]]
    attn_logits = [sum(q_h[j] * k_h[t][j] ...) / head_dim**0.5
                   for t in range(len(k_h))]
    attn_weights = softmax(attn_logits)
    head_out = [sum(attn_weights[t] * v_h[t][j] ...)
                for j in range(head_dim)]

6. Training Loop

Each step: pick a name, predict every next character, compute cross-entropy loss, backpropagate gradients, and update weights with Adam. Watch the loss decrease as the model learns.

Step: 0

Loss: —

LR: 0.0100

Current Document

No training step yet.

Loss Curve

# Training loop (lines 153-184)
for step in range(num_steps):
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    # Forward: compute loss
    for pos_id in range(n):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
    loss = (1/n) * sum(losses)
    # Backward
    loss.backward()
    # Adam update
    for i, p in enumerate(params):
        m[i] = beta1*m[i] + (1-beta1)*p.grad
        v[i] = beta2*v[i] + (1-beta2)*p.grad**2
        p.data -= lr * m_hat / (v_hat**0.5 + eps)

7. Inference / Generation

Generate new names by feeding BOS, then sampling tokens one at a time. Temperature controls randomness — low = conservative, high = creative.

Temp: 0.5

# Inference (lines 186-200)
temperature = 0.5
for sample_idx in range(20):
    token_id = BOS
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size),
                    weights=[p.data for p in probs])[0]
        if token_id == BOS: break
        sample.append(uchars[token_id])

Pre-training model...