The new report on GPT-4, a large-scale, multimodal model that supports image and text inputs and outputs (Techatty)
Open-AI - We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformerbased model pre-trained to predict the next token in a document.
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformerbased model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based on models trained with no more than 1/1,000th the compute of GPT-4.
1 Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and text inputs and producing text outputs. Such models are an important area of study as they have the
potential to be used in a wide range of applications, such as dialogue systems, text summarization,
and machine translation. As such, they have been the subject of substantial interest and progress in
recent years [1–34].
One of the main goals of developing such models is to improve their ability to understand and generate
natural language text, particularly in more complex and nuanced scenarios. To test its capabilities
in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In
these evaluations it performs quite well and often outscores the vast majority of human test takers.
For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.
This contrasts with GPT-3.5, which scores in the bottom 10%.
On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models
and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).
On the MMLU benchmark [35, 36], an English-language suite of multiple-choice questions covering
57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but
also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4
surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these
model capability results, as well as model safety improvements and results, in more detail in later
sections.
This report also discusses a key challenge of the project, developing deep learning infrastructure and
optimization methods that behave predictably across a wide range of scales. This allowed us to make
predictions about the expected performance of GPT-4 (based on small runs trained in similar ways)
that were tested against the final run to increase confidence in our training.
Despite its capabilities, GPT-4 has similar limitations to earlier GPT models [1, 37, 38]: it is not fully
reliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn
∗
Please cite this work as “OpenAI (2023)". Full authorship contribution statements appear at the end of the
document. Correspondence regarding this technical report can be sent to gpt4-report@openai.com
arXiv:2303.08774v3 [cs.CL] 27 Mar 2023 from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important.
GPT-4’s capabilities and limitations create significant and novel safety challenges, and we believe
careful study of these challenges is an important area of research given the potential societal impact.
This report includes an extensive system card (after the Appendix) describing some of the risks we
foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more.
It also describes interventions we made to mitigate potential harms from the deployment of GPT-4,
including adversarial testing with domain experts, and a model-assisted safety pipeline.
2 Scope and Limitations of this Technical Report
This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a
Transformer-style model [39] pre-trained to predict the next token in a document, using both publicly
available data (such as internet data) and data licensed from third-party providers. The model was
then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [40]. Given both
the competitive landscape and the safety implications of large-scale models like GPT-4, this report
contains no further details about the architecture (including model size), hardware, training compute,
dataset construction, training method, or similar.
We are committed to independent auditing of our technologies, and shared some initial steps and
ideas in this area in the system card accompanying this release.2 We plan to make further technical
details available to additional third parties who can advise us on how to weigh the competitive and
safety considerations above against the scientific value of further transparency.
3 Predictable Scaling
A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The
primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive
model-specific tuning. To address this, we developed infrastructure and optimization methods that
have very predictable behavior across multiple scales. These improvements allowed us to reliably
predict some aspects of the performance of GPT-4 from smaller models trained using 1, 000× –
10, 000× less compute.
3.1 Loss Prediction
The final loss of properly-trained large language models is thought to be well approximated by power
laws in the amount of compute used to train the model [41, 42, 2, 14, 15].
To verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our
internal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term
(as in Henighan et al. [15]): L(C) = aCb + c, from models trained using the same methodology
but using at most 10,000x less compute than GPT-4. This prediction was made shortly after the run
started, without use of any partial results. The fitted scaling law predicted GPT-4’s final loss with
high accuracy (Figure 1).
3.2 Scaling of Capabilities on HumanEval
Having a sense of the capabilities of a model before training can improve decisions around alignment,
safety, and deployment. In addition to predicting final loss, we developed methodology to predict
more interpretable metrics of capability. One such metric is pass rate on the HumanEval dataset [43],
which measures the ability to synthesize Python functions of varying complexity. We successfully
predicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained
with at most 1, 000× less compute (Figure 2).
For an individual problem in HumanEval, performance may occasionally worsen with scale. Despite
these challenges, we find an approximate power law relationship −EP [log(pass_rate(C))] = α∗C−k
2
In addition to the accompanying system card, OpenAI will soon publish additional thoughts on the social
and economic implications of AI systems, including the need for effective regulation.
Download the full GPT-4 Report in PDF format below. Powred by Open-AI.