Leaked details of GPT-4 unveil its massive scale and impressive architecture

The details shed light on the model's massive scale, extensive training dataset, and its architecture, offering a glimpse into its capabilities and potential.

GPT-4 has undergone training on an unprecedented scale, utilizing a colossal dataset consisting of approximately 13 trillion tokens. This vast training corpus embraces both text-based and code-based data. OpenAI's training process aimed to refine the model's performance and enhance its ability to generate high-quality outputs.

However, such ambitious training objectives come with significant costs and challenges.Training GPT-4 required significant resources, including approximately 25,000 A100 GPUs over a span of 90 to 100 days. Many difficulties and failures occurred during the training process, requiring frequent restarts from checkpoints.

If we estimate the cost of training at a rate of $1 per hour of A100 GPUs, the cost for this particular cycle alone would be an impressive $63 million dollars.

A notable aspect of the GPT-4 architecture is the use of the expert mixture model. OpenAI selected 16 experts to find a balance between achieving excellent loss results and ensuring generalizability across tasks. A larger number of experts can pose problems in terms of generalization and task convergence, prompting OpenAI's cautious approach in expert selection to ensure reliable and robust performance.

Another intriguing feature of GPT-4 is its speculative decoding strategy. The model utilizes a smaller, faster model to generate predictions for multiple tokens in advance. These predictions are then fed into a larger "oracle" model as a single batch, allowing for efficient decoding. However, the validity of these suggestions has not been confirmed at this time.

Rumors and speculations about the origin of the training data have emerged amid the leak. Some suggest the input of content from popular platforms such as Twitter, Reddit, and YouTube, suggesting the influence of user-generated content in shaping the GPT-4 knowledge base. In addition, extensive collections such as LibGen and Sci-Hub, as well as the entirety of GitHub, are considered as potential sources of training data.

While rumors continue to circulate, it is crucial to approach them with caution. There is a strong view that GPT-4 may have benefited from a special dataset consisting of college textbooks carefully collected manually. Such a dataset provides a structured and comprehensive knowledge base, possibly contributing to a broad understanding in a variety of fields.

Check out other interesting and detailed views on GPT-4 architecture, infrastructure, training dataset, cost, etc.