A Field Guide to NanoGPT Speedrun Optimizations

Near the end of 2022, Andrej Karpathy released NanoGPT¹, a small, hackable GPT. A few years later, Keller Jordan forked off the Modded NanoGPT² repository and launched the NanoGPT speedrun, which sought to improve NanoGPT’s training speed on a single 8xH100 node as much as possible.

Since then, Modded NanoGPT has gone from taking 45 minutes to train a model equivalent to GPT-2 Small to taking less than 90 seconds. Using a similar setup, Andrej Karpathy’s NanoGPT successor, NanoChat³, has reduced the time needed to train a GPT-2 XL level model from 168 hours to around 2 hours. These improvements have massively changed the economics of small-model training, making it possible to train a fully functional model for only around $50.

This post covers some of the most interesting and impactful upgrades used in both speedruns, with the goal of showing how model training got so fast. These improvements are presented in (approximate) order of importance.

Muon

Muon⁴ is an optimizer designed for the 2D parameter matrices of neural networks, and is used in both Modded NanoGPT and NanoChat. Notably, Muon has proven to be very effective even for larger models. For example, Moonlight⁵ estimates that Muon gives approximately 2x computational efficiency when compared to AdamW. For parameters that are not neural net parameter matrices, both NanoGPT and NanoChat use AdamW instead. Muon uses standard stochastic gradient descent with momentum, but orthogonalizes the update matrix using a single Newton-Schulz iteration. Why does this help so much? The original Muon blog post⁴ conjectures that

The updates produced by both SGD-momentum and Adam for the 2D parameters in transformer-based neural networks typically have very high condition number. That is, they are almost low-rank matrices, with the updates for all neurons being dominated by just a few directions. We speculate that orthogonalization effectively increases the scale of other “rare directions” which have small magnitude in the update but are nevertheless important for learning.

Modded NanoGPT and NanoChat use an efficient distributed version of Muon that distributes the Newton-Schulz iteration over multiple GPUs⁶.

Cautious Weight Decay

Cautious weight decay⁷ only applies weight decay when the update and the weight have the same sign ($\text{update} \times \text{weight} > 0$). The intuition for this is that if the signs are different, then the update is already pulling the weight back towards zero. Both Modded NanoGPT and NanoChat use cautious weight decay. One slight difference between the two is that Modded NanoGPT uses cautious weight decay for Adam as well, while NanoChat has no weight decay for Adam.

Momentum Warmup

Both Modded NanoGPT and NanoChat gradually increase the momentum used for Muon from 0.85 to 0.95. The intuition for this is that the loss landscape changes more rapidly early on, so lower momentum is desirable⁸.

Attention

This section covers various improvements to the attention mechanism, mainly along two lines:

Moving to more optimized implementations of the attention computation
Alterations to the size of the attention window

FlashAttention

The FlashAttention series of optimized attention implementations gives a major speedup to attention computations⁹ ¹⁰ ¹¹. Both Modded NanoGPT and NanoChat use FlashAttention 3.

Sliding Window Attention

Both Modded NanoGPT and NanoChat use a short-long pattern for attention windows, where there are several short-window attention layers, followed by one long-window layer. Both Modded NanoGPT and NanoChat use a repeating SSSL pattern, where there are three short windows followed by a long window that is twice the length. However, the final layer is forced to always be a long window. This pattern is slightly altered for Modded NanoGPT, due to there being only 11 layers, so Modded NanoGPT only uses long-window attention on layers 4 and 11.

A similar method of alternating short and long attention windows was used in GPT-3¹².

Attention Window Warmup

Modded NanoGPT uses an attention window schedule, where the size of the attention window is gradually increased¹³. One disadvantage of changing the window size when using FlashAttention 3 is that each change requires some recompilation. Due to this, Modded NanoGPT increases the window size in a few large steps throughout training.

An example of alternating attention window lengths, with window length increasing over time. Figure by Franz Louis Cesista.

ClimbMix

While Modded NanoGPT does not allow changing the training data, NanoChat has experimented with several different training datasets, eventually settling on Nemotron-ClimbMix¹⁴. ClimbMix was created through the following procedure:

Begin with data from:
- Nemotron-CC¹⁵, which filters Common Crawl¹⁶ data for the highest-quality data
- SmolLM-Corpus¹⁷, a mixture of general educational data from FineWeb-Edu¹⁸, programming data from The Stack¹⁹, and synthetic data
Cluster the data, creating 21 semantic clusters that cover a variety of different topics
Using a small model, learn the optimal data mixture for performance on downstream tasks

This results in a high-quality dataset that appropriately weights the frequency of various topics.

Nemotron-ClimbMix topic clusters. Figure from Diao et al.

Modded NanoGPT uses FineWeb-Edu, a deduplicated and filtered selection of Common Crawl that was further filtered for data with high educational quality.

Architectural Modernizations

This section includes architectural modifications that are already relatively well-known, and are implemented in both Modded NanoGPT and NanoChat with minimal adjustments. Additionally, many of these adjustments are fairly standard in modern open-source LLMs. In that sense, the modifications in this section can be considered to be bringing Modded NanoGPT and NanoChat up to speed with modern Transformer implementations.

Pre-Norm

Pre-norm²⁰ moves the normalization blocks out of the residual stream, placing them at the start of each sub-block (attention / MLP). This controls the gradient norm and increases training stability. This was also used in GPT-2²¹, along with many other models.

Post-norm (left) versus pre-norm (right). Figure from Xiong et al.

QKNorm

QKNorm²² normalizes each query vector $q_i$ and key vector $k_j$ after splitting along the head dimension and applying RoPE. While the original QKNorm paper uses the $\ell_2$-norm, both NanoChat and Modded NanoGPT apply RMSNorm instead. The primary benefit of this improvement is stopping the softmaxes in the attention computation from being easily saturated by large key or query vectors, allowing for more diverse attention patterns. This also improves training by stopping the attention gradients from growing extremely small. However, one disadvantage of this approach is that it can lead to the model not being able to sufficiently “focus” on important tokens, which has led to some later efforts to tweak the attention scale²³.

Attention without QK-norm (above) versus attention with QK-norm (below). Figures from Henry et al.

RMSNorm

RMSNorm²⁴ normalizes all input vectors $\mathbf{a}$ according to the formula

\[\overline{a}_i = \frac{a_i}{\text{RMS}(\mathbf{a})} g_i\]

Where $a_i$ is the $i$-th value of $\mathbf{a}$, $g_i$ is a learnable gain parameter, and RMS is the root mean square of the relevant vector. Both Modded NanoGPT and NanoChat also drop the gain parameter $g_i$, yielding

\[\overline{a}_i = \frac{a_i}{\text{RMS}(\mathbf{a})}\]

RMSNorm is a simpler and faster normalization method than LayerNorm²⁵, which was used in the original Transformer.

RoPE

Rotary Position Embeddings²⁶ adds positional information to the query and key vectors before performing the attention computation, rather than adding positional information when constructing the initial hidden state. More specifically, RoPE rotates pairs of dimensions in the query and key vectors based on each token’s position and the rotation speed for that dimension pair. RoPE helps models learn to include positional information without adding extra parameters.

ReLU²

The ReLU² activation function²⁷ has been previously shown to strike a good balance between having high sparsity and good performance²⁸. ReLU² is defined as

\[\text{ReLU}^2(x) = \max(0, x)^2\]

Many modern LLMs instead use SwiGLU²⁹ as their activation function. SwiGLU was tested in NanoChat on several scales, but consistently gave decreased performance.

Untied Embeddings

The original NanoGPT had tied embeddings, where the LM head matrix is the transpose of the input embedding matrix (as recommended by³⁰). Both Modded NanoGPT and NanoChat untied these matrices, which is common in modern LLMs. One of the reasons why untying is likely to be beneficial in this case is because it increases the parameter count without causing a corresponding increase in the number of FLOPs per pass.

As a side note, tied embeddings can cause some unexpected issues (see Neel Nanda on the SolidGoldMagikarp token for one interesting example³¹).

Skip Connections & Value Embeddings

Value Embeddings

Value residual learning³² proposes modifying the computed value matrix $V_n$ on layer $n$ according to the formula

\[V'_n = \lambda_{n,1} \cdot V_n + \lambda_{n,2} \cdot V_1\]

This is essentially mixing together the values on layer $n$ with the values on layer 1, allowing access to the computed values from lower layers. A variant of this was originally implemented in Modded NanoGPT as

\[V'_n = (1 - \lambda_{n}) \cdot V_n + \lambda_{n} \cdot V_1\]

where $\lambda_{n}$ is a learnable parameter for each layer³³.

Later, this formula was changed to mix $V_n$ and $VE_n$, where $VE_n$ are learnable value embeddings for the given layer³⁴. More specifically, each layer is given an embedding matrix, which is then used to get layerwise value embeddings for each token, $VE_n$. The primary motivation for this change was to provide token-specific features to the attention values. This change replaces the original value residual learning formula with

\[V'_n = (1 - \lambda_{n}) \cdot V_n + \lambda_{n} \cdot VE_n\]

Both Modded NanoGPT and NanoChat use value embeddings. However, they structure them slightly differently:

NanoChat adds value embeddings to every other layer
Modded NanoGPT shares value embeddings in a U-Net style pattern³⁵ ³⁶. For example, in the 12 layer architecture used at the time, layers 1 and 12 would share the same value embeddings, layers 2 and 11 would have the same value embeddings, etc.

Value embeddings seem to be particularly helpful for model speedrunning because they add a large number of parameters without causing a correspondingly large increase in the number of FLOPs per token.

Hidden State Residual Connections

Modded NanoGPT also adds skip connections for the hidden state³³, updating the hidden state at the start of layer $n$ according to

\[X'_n = \lambda_{n,1} \cdot X_n + \lambda_{n,2} \cdot X_1\]

Since $X_1$ is the hidden state at the start of the first layer, this is equivalent to mixing the current layer’s hidden state and the relevant token embeddings.

This was later modified to use a U-Net style architecture³⁵, adding connections to previous hidden states³⁷. This changed the hidden state residual formula to

\[X'_n = \lambda_{n,1} \cdot X_n + \lambda_{n,2} \cdot X_1 + \lambda_{n,3} \cdot X_k\]

This change was only made for $n > \frac{\text{layers}}{2}$, with $k = \text{layers} - n + 1$. For example, with the 12 layer architecture used at the time, only layers 7 through 12 would use this updated formula. Layer 6 would feed into 7, 5 into 8, and so on, up to 1 into 12.

Logit Soft-Capping

Gemma 2³⁸ applies a soft cap to logits, using the following formula

\[\text{logits} \gets \text{soft_cap} \cdot \tanh\!\left(\frac{\text{logits}}{\text{soft_cap}}\right)\]

This limits the range of the logits to a range of $[-\text{soft_cap}, +\text{soft_cap}]$. One intuition for why this is helpful is that it limits the model’s confidence; for small models, it is less likely that they will be completely certain of the next token, so capping the logits stops them from expressing 100% certainty. This also helps with vanishing gradient issues.

Both Modded NanoGPT and NanoChat implement this soft cap with $\text{soft_cap} = 15$. However, they only use the soft cap for the LM head, while Gemma 2 applies it to the attention logits as well. It is likely that the optimal value of the softcap would be larger for larger models, since the ability to express increased confidence would become more useful.

Low Precision

Both Modded NanoGPT and NanoChat have attempted to reduce the precision of their models, but in different ways.

Modded NanoGPT uses FP8 for the LM head only. This was also tested in NanoChat, but did not give a significant benefit.

Instead, NanoChat uses FP8 for all linear layers, which gives a speedup of approximately 17% tokens per second during training, but takes more tokens to reach the same validation loss, resulting in the speedup being smaller overall (~5% speedup). This seems to give greater benefits for larger models, as testing full FP8 on smaller models made them slower overall. Full FP8 is most effective when using tensorwise scaling, rather than rowwise scaling.

Miscellaneous Improvements

Aligning to Multiples of 64

Both Modded NanoGPT and NanoChat inherit padded embeddings from the original NanoGPT. While the original NanoGPT had a vocabulary size of 50,257, padding to the next multiple of 64 (50,304) gave a major speedup³⁹. In general, it is important to make sure that matrix dimensions are multiples of a sufficiently large power of 2 (16, 32, 64, 128, etc).

Zero-Initialized Output Layers

In appendix D.2, Tensor Programs V ⁴⁰ suggests initializing output layers as zero, which makes the optimal hyperparameters for different model sizes match more closely. Both Modded NanoGPT and NanoChat follow this recommendation by initializing the final attention projection layer as zero and initializing the output layer of all MLPs as zero. Interestingly, this gives a speedup on Modded NanoGPT, despite Tensor Programs V only recommending it as a way to improve hyperparameter transfer, and simply noting that “we do not find this modification to be detrimental to performance”.

References

Karpathy 2022, nanoGPT ↩
Jordan et al 2024, modded-nanogpt README ↩ ↩²
Karpathy 2025, nanochat: The best ChatGPT that $100 can buy ↩
Jordan et al 2024, Muon: An optimizer for hidden layers in neural networks ↩ ↩²
Liu et al 2025, Muon is Scalable for LLM Training ↩
Keller Jordan post on X ↩
Chen et al 2025, Cautious Weight Decay ↩
Keller Jordan post on X ↩
Dao et al 2022, FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness ↩
Dao 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning ↩
Shah et al 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision ↩
Brown et al 2020, Language Models are Few-Shot Learners ↩
Fern post on X ↩
Diao et al 2025, Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training ↩
Su et al 2024, Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset ↩
Common Crawl ↩
Allal et al 2024, SmolLM-Corpus ↩
Penedo et al 2024, The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale ↩
Kocetkov et al 2022, The Stack: 3 TB of permissively licensed source code ↩
Xiong et al 2020, On Layer Normalization in the Transformer Architecture ↩
Radford et al 2019, Language Models are Unsupervised Multitask Learners ↩
Henry et al 2020, Query-Key Normalization for Transformers ↩
Franz Louis Cesista post on X ↩
Zhang & Sennrich 2019, Root Mean Square Layer Normalization ↩
Ba et al 2016, Layer Normalization ↩
Su et al 2023, RoFormer: Enhanced Transformer with Rotary Position Embedding ↩
So et al 2022, Primer: Searching for Efficient Transformers for Language Modeling ↩
Zhang et al 2024, ReLU² Wins: Discovering Efficient Activation Functions for Sparse LLMs ↩
Shazeer 2020, GLU Variants Improve Transformer ↩
Press & Wolf 2017, Using the Output Embedding to Improve Language Models ↩
Neel Nanda 2023, comment on “SolidGoldMagikarp (plus, prompt generation)” ↩
Zhou et al 2025, Value Residual Learning ↩
Grad62304977 post on X ↩ ↩²
Braden Koszarsky post on X ↩
Ronneberger et al 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation ↩ ↩²
You Jiacheng post on X ↩
Brendan Hogan post on X ↩
Gemma Team 2024, Gemma 2: Improving Open Language Models at a Practical Size ↩
Karpathy post on X ↩
Yang et al 2022, Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer ↩
Karpathy 2026, Beating GPT-2 for «$100: the nanochat journey ↩
Karpathy 2025, dev/LOG.md ↩
Dial 2025, How the NanoGPT Speedrun WR dropped by 20% in 3 months ↩
Srivastava 2025, Muon in Modded NanoGPT ↩