Questions

  1. What kinds of detail processes and infrastructures for training and deploying LLMs and operating LLM based services?

Machine Learning Training Components

If we boil machine learning model training to its most simplistic form, there are two major time components in a machine learning model’s training time.

  1. Compute (FLOPS): Running dense matrix multiplication within each layer

    In the past, the dominant factor in machine learning training time was compute time, waiting for matrix multiplies. As Nvidia’s GPUs continued to develop, this quickly faded away from being the primary concern.

    Nvidia’s FLOPS have increased multiple orders of magnitude by leveraging Moore’s Law, but primarily architectural changes such as the tensor core and lower precision floating point formats. In contrast, memory has not followed the same path.

    Untitled

  2. Memory (Bandwidth): Waiting for data or layer weights to get to the compute resources. Common examples of bandwidth-constrained operations are various normalizationspointwise operationsSoftMax, and ReLU.

Current State-Of-The-Art Models

MosaicML already claims to be able to train GPT-3 quality models for less than $500,000 and Chinchilla size and quality models for ~$2,500,000. They offer that pricing to customers, today.

This is significantly cheaper than most expect, which begs the question; how much higher can we scale?

Last week, we discussed the components of machine learning, model utilization rates, and the software stack. The critical takeaway is that even with all the optimization techniques the ecosystem developed in 2022, Nvidia A100 GPUs have an upper bound of ~60% model/hardware FLOPS utilization rates for training large language models.

The research arm of SemiAnalysis has done surveys of many startups and enterprises and arrived at ~$1.5 per SXM A100 GPU per hour as a baseline cost for large clusters of 256 GPUs with NVLink and 1.6T networking. Some companies have better deals with AWS, Azure, Oracle Cloud, CoreWeave, etc., but this is a baseline. The deals will also be much better if the company purchasing GPUs agrees to a three-year contract. For example, the list price at Azure is only $1.36 per hour, but signing up for the 2020 released A100 for three years is not something most want to do. On-premises will also be cheaper over multiple years if utilization rates are high, but that is very difficult for most enterprises and startups to commit to/achieve.

Untitled

The above chart shows publicized state-of-the-art models with their respective parameter counts and tokens (units of training data). The line is Google DeepMind’s Chinchilla scaling observation (smoothing out the large error bars). Each point on the line shows the theoretical FLOPS required to train a model with that parameter and token count. The FLOPS figure shown ignores any recompute of activations, checkpointing, etc.

Untitled

This table is a theoretical optimal cost to train the model on Nvidia A100s. It does not account for the people required, ML Ops tools, data gathering/preprocessing, failure restoration, one-shot/few-shot learning examples, inference, etc. Many of these components are incredibly costly. In this context, MosaicML’s cost of $450k for GPT-30B and $2.5M for GPT-70B are close to the optimal training costs of $326k and $1.75M. It should be noted that Mosaic’s prices include many of those ML Ops tools, which significantly reduces the personnel required to train a model reliably.

As we move down the stack, the Google Pathways Language Model (PaLM) is the most advanced dense model that has been trained and publicly detailed. While we used Nvidia A100s as a baseline for the cost comparison, it should be noted that PaLM was trained on 6,144 of Google’s in-house TPU v4. Google achieved 46.2% model FLOPS utilization and 57.8% hardware FLOPS utilization. The compute cost to train PaLM is not cost prohibitive.