Multi-Query Attention. OpenAI are using MQA just like...

Multi-Query Attention

OpenAI are using MQA just like everybody else.
Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache. Even then, the 32k seqlen GPT-4 definitely cannot run on 40GB A100s, and the 8k is capped on max bsz.

Continuous batching

OpenAI implements both variable batch sizes and continuous batching. This is so as to allow some level of maximum latency as well optimizing the inference costs.

Vision Multi-Modal

It is a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text only pre-training.
On the vision model, OpenAI wanted to train it from scratch, but it wasn’t mature enough, so they wanted to derisk it by starting with text.
One of the primary purposes of this vision capability is for autonomous agents able to read web pages and transcribe what’s in images and video.
Some of the data they train on is joint data (rendered LaTeX/text), screen shots of web page, youtube videos: sampling frames, and run Whisper around it to get transcript.

[Dont want to say "I told you so" but..]

Speculative Decoding

OpenAI might be using speculative decoding on GPT-4's inference. (not sure 100%)

The idea is to use a smaller faster model to decode several tokens in advance, and then feeds them into a large oracle model as a single batch.
If the small model was right about its predictions – the larger model agrees and we can decode several tokens in a single batch.
But if the larger model rejects the tokens predicted by the draft model then the rest of the batch is discarded. And we continue with the larger model.
The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.

Inference Architecture

The inference runs on a cluster of 128 GPUs.

There are multiple of these clusters in multiple datacenters in different locations.

It is done in 8-way tensor parallelism and 16-way pipeline parallelism.

Each node of 8 GPUs has only ~130B parameters, or… twitter.com/i/web/status/1…
The model has 120, so it fits in 15 different nodes.
[Possibly the there are less layers on the first node since it needs to also compute the embeddings]
According to these numbers: OpenAI should have trained on 2x the tokens if they were trying to go by chinchilla's optimal.

[let alone surpass it like we do]

This goes to show that they are struggling to get high quality data.
Why no FSDP?

A possible reason for this could be that some of the hardware infra they secured is of an older generation.

This is pretty common at local compute clusters as the organisation usually upgrade the infra in several "waves" to avoid a complete pause of operation.… twitter.com/i/web/status/1…

Источник: gonzo-обзоры ML статей

2023-07-11 15:14:08