Heuristic Model-Scaling Projection for Learning Rate Estimation

10605 Machine Learning with Large Datasets

Training large-scale models like GPT-2 from scratch is computationally expensive, making it impractical to run full-scale hyperparameter sweeps. This project explores a zero-shot hyperparameter tuning strategy using Heuristic Model-Scaling Projection (MSP) to estimate optimal learning rates at large model sizes. The goal was to identify the optimal learning rate for a 124M parameter GPT-2 model by training significantly smaller models and extrapolating from their performance trends.

I used a log-uniform sampling strategy to test 39 different learning rates ranging from 10e-4 to 10e-1. Each sampled learning rate was evaluated by training a simplified GPT-2 model (1 transformer layer, 64 channels, 1 attention head) for 1000 steps. After identifying the best learning rates at this smallest scale, I repeated this process for two larger models: one with 2 layers, 128 channels, and 2 heads (Tinier), and one with 3 layers, 192 channels, and 3 heads (Tiny). I recorded the best learning rate for each size, then plotted these points against model parameter count on a log-log scale. A regression line was fitted through the three data points to model the functional relationship between model size and optimal learning rate. This line was used to extrapolate the predicted optimal learning rate for the 124M-parameter GPT-2 Small model.

Performance Comparison: Pareto Frontier

The log-log regression showed a smooth, upward-trending relationship between model size and optimal learning rate. The projected optimal learning rate for the full 124M parameter model was 0.0064, shown in the plot above with the red X. The result aligned well with the range of learning rates that led to stable convergence in early tests. The sublinear growth pattern suggests that while larger models require slightly higher learning rates, the increase is not drastic.

This project demonstrates that model-scaling laws can be leveraged to estimate effective hyperparameters without full-scale training. The MSP approach provided a strong prediction for the optimal learning rate at 124M parameters while using only a fraction of the total compute budget. Sampling from a log-uniform distribution allowed broad coverage of the learning rate space, and the projection method proved both practical and interpretable. The success of this approach suggests its utility in future large-scale training scenarios where budget constraints limit the number of full runs. Extensions to this work could include applying similar scaling analysis to other hyperparameters (e.g., batch size or weight decay), and combining it with data-scaling projections or pruning strategies for even greater efficiency.