Process of training, deploying, maintaining LLMs

Questions

What kinds of detail processes and infrastructures for training and deploying LLMs and operating LLM based services?

Machine Learning Training Components

If we boil machine learning model training to its most simplistic form, there are two major time components in a machine learning model’s training time.

Compute (FLOPS): Running dense matrix multiplication within each layer

In the past, the dominant factor in machine learning training time was compute time, waiting for matrix multiplies. As Nvidia’s GPUs continued to develop, this quickly faded away from being the primary concern.

Nvidia’s FLOPS have increased multiple orders of magnitude by leveraging Moore’s Law, but primarily architectural changes such as the tensor core and lower precision floating point formats. In contrast, memory has not followed the same path.
Memory (Bandwidth): Waiting for data or layer weights to get to the compute resources. Common examples of bandwidth-constrained operations are various normalizations, pointwise operations, SoftMax, and ReLU.

Current State-Of-The-Art Models

MosaicML already claims to be able to train GPT-3 quality models for less than $500,000 and Chinchilla size and quality models for ~$2,500,000. They offer that pricing to customers, today.

This is significantly cheaper than most expect, which begs the question; how much higher can we scale?

Last week, we discussed the components of machine learning, model utilization rates, and the software stack. The critical takeaway is that even with all the optimization techniques the ecosystem developed in 2022, Nvidia A100 GPUs have an upper bound of ~60% model/hardware FLOPS utilization rates for training large language models.

The relationship between model, hardware, and FLOPs (floating-point operations per second) utilization is crucial in understanding the performance and efficiency of deep learning models. The model refers to the architecture and parameters of a neural network, while hardware refers to the computational resources, such as CPUs and GPUs, used to run the model. FLOPs utilization is a measure of the computational complexity of a model, indicating the number of floating-point operations required to execute the model. A deep learning model's performance is influenced by its complexity, the hardware it runs on, and the FLOPs utilization. A more complex model with a higher number of FLOPs may require more powerful hardware to run efficiently. On the other hand, a simpler model with fewer FLOPs may run on less powerful hardware but may not achieve the same level of accuracy or performance as a more complex model.
For example, consider two deep learning models for image classification: Model A and Model B. Model A has a higher number of FLOPs, indicating that it is more computationally complex than Model B. If both models are run on the same hardware, Model A may take longer to execute due to its higher FLOPs utilization. However, Model A may also achieve higher accuracy in image classification due to its more complex architecture.
On the other hand, if Model A is run on more powerful hardware with a higher FLOPs capacity, it may execute faster and achieve better performance. In this case, the relationship between the model, hardware, and FLOPs utilization is crucial in determining the overall performance and efficiency of the deep learning model. In summary, the relationship between model, hardware, and FLOPs utilization is essential in understanding the performance and efficiency of deep learning models. A more complex model with a higher number of FLOPs may require more powerful hardware to run efficiently, while a simpler model with fewer FLOPs may run on less powerful hardware but may not achieve the same level of accuracy or performance. Balancing these factors is crucial in optimizing deep learning models for specific tasks and hardware configurations.

The research arm of SemiAnalysis has done surveys of many startups and enterprises and arrived at ~$1.5 per SXM A100 GPU per hour as a baseline cost for large clusters of 256 GPUs with NVLink and 1.6T networking. Some companies have better deals with AWS, Azure, Oracle Cloud, CoreWeave, etc., but this is a baseline. The deals will also be much better if the company purchasing GPUs agrees to a three-year contract. For example, the list price at Azure is only $1.36 per hour, but signing up for the 2020 released A100 for three years is not something most want to do. On-premises will also be cheaper over multiple years if utilization rates are high, but that is very difficult for most enterprises and startups to commit to/achieve.

Untitled

The above chart shows publicized state-of-the-art models with their respective parameter counts and tokens (units of training data). The line is Google DeepMind’s Chinchilla scaling observation (smoothing out the large error bars). Each point on the line shows the theoretical FLOPS required to train a model with that parameter and token count. The FLOPS figure shown ignores any recompute of activations, checkpointing, etc.

Untitled

This table is a theoretical optimal cost to train the model on Nvidia A100s. It does not account for the people required, ML Ops tools, data gathering/preprocessing, failure restoration, one-shot/few-shot learning examples, inference, etc. Many of these components are incredibly costly. In this context, MosaicML’s cost of $450k for GPT-30B and $2.5M for GPT-70B are close to the optimal training costs of $326k and $1.75M. It should be noted that Mosaic’s prices include many of those ML Ops tools, which significantly reduces the personnel required to train a model reliably.

As we move down the stack, the Google Pathways Language Model (PaLM) is the most advanced dense model that has been trained and publicly detailed. While we used Nvidia A100s as a baseline for the cost comparison, it should be noted that PaLM was trained on 6,144 of Google’s in-house TPU v4. Google achieved 46.2% model FLOPS utilization and 57.8% hardware FLOPS utilization. The compute cost to train PaLM is not cost prohibitive.