Hello Sora. Sora is OpenAI's new AI model that can create realistic and imaginative scenes

OpenAI - Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.

Techatty

Feb 19, 2024 - 14:03

Feb 18, 2024 - 15:08

We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.

Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.

Sora is a diffusion model, which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps.

Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.

Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.

We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.

Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.

In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames. Learn more in our technical report.

Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.

Video generation models as world simulators

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,^1,^2,³ generative adversarial networks,^4,^5,^6,⁷ autoregressive transformers,^8,⁹ and diffusion models.^10,^11,¹² These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

Turning visual data into patches

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.^13,¹⁴ The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.^15,^16,^17,¹⁸ We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,¹⁹ and subsequently decomposing the representation into spacetime patches.

Video compression network

We train a network that reduces the dimensionality of visual data.²⁰ This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

Spacetime latent patches

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

Scaling transformers for video generation

Sora is a diffusion model^21,^22,^23,^24,²⁵; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.²⁶ Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,^13,¹⁴ computer vision,^15,^16,^17,¹⁸ and image generation.^27,^28,²⁹

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

Tags:

Adobe introduces Firefly AI engine in Photoshop, to put generative AI into the h...

Techatty Connecting the world of tech differently! Read. Write. Learn. Thrive. Make an informed decision without distractions. We are building tech media and publication networks to connect YOU and everyone to reliable information, opportunities, and resources to achieve greater success.

Sponsor to Give Hope, Transform, and Uplift Lives.

	Need help implementing innovative technology, with tech support or management? You can count on us.
	24-7 Press Release - Let's distribute your Press Releases to traditional and digital media outlets. Get started!
	Reliable Website Security Solutions, built for small businesses, web professionals, and enterprise organizations.
	Paternity Lab - bringing DNA Paternity Testing closer to people. We offer accurate, affordable, and easy DNA Paternity Testing. Also at home.
	Rexing USA - exclusive cameras, car gadgets, and EV accessories with unique designs, innovative technology, and in affordable price ranges.

The Rising Wave of Blockchain Technology Adop...

HackaTRON Season 7 Launches With Google Cloud...

Skybridge Founder: Kamala Harris Open-Minded ...

Auradine Ships 3nm Teraflux Bitcoin Mining Pl...

Wazirx Details Plan to Resume Withdrawals and...

Agentic AI Leaders to Showcase Latest Advance...

NVIDIA Releases NIM Microservices to Safeguar...

How AI Is Enhancing Surgical Safety and Educa...

NVIDIA and IQVIA Build Domain-Expert Agentic ...

AI Gets Real for Retailers: 9 Out of 10 Retai...

Alleged Co-Founder of Garantex Arrested in India

Feds Link $150M Cyberheist to 2022 LastPass H...

Who is the DOGE and X Technician Branden Spikes?

Notorious Malware, Spam Host “Prospero” Moves...

U.S. Soldier Charged in AT&T Hack Searched “C...

Hello Sora. Sora is OpenAI's new AI model that can create realistic and imaginative scenes

OpenAI - Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.

Video generation models as world simulators

Turning visual data into patches

Video compression network

Spacetime latent patches

Scaling transformers for video generation

Tags:

Adobe introduces Firefly AI engine in Photoshop, to put generative AI into the h...

Sponsorship Hot Deal

NVIDIA AI Now Available in Oracle Cloud Marketplace

Who’ll Stop the Rain? Scientists Call for Climate Colla...

There’s a symphony in the antibody protein the body mak...

Change language

SPONSORED

Recommended for you

Great Opportunity You Can't Reject! (No, Seriously...

Pause and let's talk about responsible spending an...

Experts Estimate £20 Million+ Loss from Heathrow A...

Welcome to ProtoPie

Ready to turn your innovative tech business dream ...

Gold Could Surge to $40,000 per Ounce, Strategist ...

Web & Cloud - Engineering Tech for a Better Tomorrow!

Introducing: Techatty Aerospace

Hello Sora. Sora is OpenAI's new AI model that can create realistic and imaginative scenes

OpenAI - Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.

Video generation models as world simulators

Turning visual data into patches

Video compression network

Spacetime latent patches

Scaling transformers for video generation

Tags:

Related Posts

Change language

SPONSORED

Recommended for you