OpenAI’s most durable moat is that they have the most real-world usage, leading engineering talent, and can continue to race ahead of others with future models.
The most interesting aspect of GPT-4 is understanding why they made certain architectural decisions.
Materials
GPT-4's details are leaked.It is over.Everything is here: twitter.com/i/web/status/1…
Parameters count:GPT-4 is more than 10x the size of GPT-3. We believe it has a total of ~1.8 trillion parameters across 120 layers.
Mixture Of Experts - Confirmed.OpenAI was able to keep costs reasonable by utilizing a mixture of experts (MoE) model.They utilizes 16 experts within their model, each is about ~111B parameters for MLP. 2 of these experts are routed to per forward pass.
MoE Routing:While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAI’s is allegedly quite simple, for the current GPT-4 model.There roughly ~55B shared parameters for attention.
Inference:Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.
Dataset:GPT-4 is trained on ~13T tokens.These are not unique tokens, they count the epochs as more tokens as well.Epoch number: 2 epochs for text-based data and 4 for code-based data.There is millions of rows of instruction fine-tuning data from ScaleAI & internally.
GPT-4 32KThere was an 8k context length (seqlen) for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training.
Batch Size:The batch size was gradually ramped up over a number of days on the cluster, but by the end, OpenAI was using a batch size of 60 million! This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens.
For the real batch size:Divide this number by the seq len to get the real batch size. just stop with this misleading numbers already.
Parallelism StrategiesTo parallelize across all their A100s GPUs They utilized 8-way tensor parallelism as that is the limit for NVLink.Beyond that, they are using 15-way pipeline parallelism.(likely used ZeRo Stage 1. It is possible they used block-level FSDP)