OpenAI is introducing GPT-4, the smartest AI Assistant that can solve difficult problems with greater accuracy
GPT-4 is OpenAI’s most advanced AI Assistant System. GPT-4 can solve difficult problems with greater accuracy than ChatGPT, thanks to its broader general knowledge and problem-solving abilities. GPT-4 is more creative and collaborative than ever before. We’re excited to see how people use GPT-4 as we work towards developing technologies that empower everyone.
OpenAI is introducing GTP-4, the smartest AI Assistant that can solve difficult problems with greater accuracy
GPT-4 is OpenAI’s most advanced AI Assistant System. GPT-4 can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities. GPT-4 is more creative and collaborative than ever before.
GTP-4 can generate, edit, and iterate with users on creative and technical writing tasks, such as composing songs, writing screenplays, or learning a user’s writing style.
GPT-4 surpasses ChatGPT in its advanced reasoning capabilities. GPT-4 outperforms ChatGPT by scoring in higher approximate percentiles among test-takers.
Safety & alignment
Training with human feedback
We incorporated more human feedback, including feedback submitted by ChatGPT users, to improve GPT-4’s behavior. We also worked with over 50 experts for early feedback in domains including AI safety and security.
Continuous improvement from real-world use
We’ve applied lessons from real-world use of our previous models into GPT-4’s safety research and monitoring system. Like ChatGPT, we’ll be updating and improving GPT-4 at a regular cadence as more people use it.
GPT-4-assisted safety research
GPT-4’s advanced reasoning and instruction-following capabilities expedited our safety work. We used GPT-4 to help create training data for model fine-tuning and iterate on classifiers across training, evaluations, and monitoring.
We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails.
Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload. A year ago, we trained GPT-3.5 as a first “test run” of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. As we continue to focus on reliable scaling, we aim to hone our methodology to help us predict and prepare for future capabilities increasingly far in advance—something we view as critical for safety.
We are releasing GPT-4’s text input capability via ChatGPT and the API (with a waitlist). To prepare the image input capability for wider availability, we’re collaborating closely with a single partner to start. We’re also open-sourcing OpenAI Evals, our framework for automated evaluation of AI model performance, to allow anyone to report shortcomings in our models to help guide further improvements.
Capabilities
In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.
To understand the difference between the two models, we tested on a variety of benchmarks, including simulating exams that were originally designed for humans. We proceeded by using the most recent publicly-available tests (in the case of the Olympiads and AP free response questions) or by purchasing 2022–2023 editions of practice exams. We did no specific training for these exams.
A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see GPT-4 technical report for download in PDF Format.