21 min read

"Road to Sora" Paper Reading List

"Road to Sora" Paper Reading List

This post is an effort to put together a reading list for our Friday paper club called ArXiv Dives. Since there has not been an official paper released yet for Sora, the goal is follow the bread crumbs from OpenAI's technical report on Sora. We plan on going over a few of the fundamental papers in the coming weeks during our Friday paper club, to help paint a better picture of what is going on behind the curtain of Sora.

If you'd like to join us live this Friday as we dive into this series, feel free to join below 👇 the more the merrier!

Arxiv Dives with Oxen.AI - SORA · Zoom · Luma
Hey Nerd, join the Herd!... for a little book/paper review. WHAT TO EXPECT Each week we pick a topic to cover in depth and have open Q/A and discussion. Reading optional 🙃. We’ll alternate…

What is Sora?

Sora has taken the Generative AI space by storm with it's ability to generate high fidelity videos from natural language prompts. If you haven't seen an example yet, here's a generated video of a turtle swimming in a coral reef for your enjoyment.


While the team at OpenAI has not released an official research paper on the technical details of the model itself, they did release a technical report that covers some high level details of the techniques they used and some qualitative results.

Video generation models as world simulators
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

Sora Architecture Overview

After reading the papers below, the architecture here should start to make sense. The technical report is a 10,000 foot view and my hope is that each paper will zoom into different aspects and paint the full picture. There is a nice literature review called "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models" that gives a high level diagram of a reverse engineered architecture.

The team at OpenAI states that Sora is a "Diffusion Transformer" which combines many of the concepts listed in the papers above, but applied applied to latent spacetime patches generated from video.

This is a combination of the style of patches used in the Vision Transformer (ViT) paper, with latent spaces similar to the Latent Diffusion Paper, but combined in the style of the Diffusion Transformer. They not only have patches in width and height of the image but extend it to the time dimension of video.

It's hard to say how exactly they collected the training data for all of this, but it seems like a combination of the techniques in the Dalle-3 paper as well as using GPT-4 to elaborate on textual descriptions of images, that they then turn into videos. Training data is likely the main secret sauce here, hence has the least level of detail in the technical report.

Use Cases

There are many interesting use cases and applications for video generation technologies like Sora. Whether it be movies, education, gaming, healthcare or robotics, there is no doubt generating realistic videos from natural language prompts is going to shake up multiple industries.

The note at the bottom of this diagram rings true for us at Oxen.ai. If you are not familiar with Oxen.ai we are building open source tools to help you collaborate on and evaluate data the comes in and out of machine learning models. We believe that many people need visibility into this data, and that it should be a collaborative effort. AI is touching many different fields and industries and the more eyes on the data that trains and evaluates these models, the better.

Check us out here: https://oxen.ai

Paper Reading List

There are many papers linked in the references section of the OpenAI technical report but it is a bit hard to know which ones to read first or are important background knowledge. We've sifted through them and selected what we think are the most impactful and interesting ones to read, and organized them by type.

Background Papers

The quality of generated images and video have been steadily increasing since 2015. The biggest gains that caught the general public's eyes began in 2022 with Midjourney, Stable Diffusion and Dalle. This section contains some foundational papers and model architectures that are referenced over and over again in the literature. While not all papers are directly involved in the Sora architecture, they are all important context for how the state of the art has improved over time.

Source: https://arxiv.org/abs/2402.17177

We have covered many of the papers below in previous ArXiv Dives if you want to catch up, all the notes are on the Oxen.ai blog.

Arxiv Dives | Oxen.ai
Manage your machine learning datasets with Oxen AI.


"U-Net: Convolutional Networks for Biomedical Image Segmentation" is a great example of a paper that was used for a task in one domain (Biomedical imaging) that got applied across many different use cases. Most notably is the backbone many diffusion models such as Stable Diffusion to facilitate learning to predict and mitigate noise at each step. While not directly used in the Sora architecture, important background knowledge for previous state of the art.

U-Net: Convolutional Networks for Biomedical Image Segmentation
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

Language Transformers

"Attention Is All You Need" is another paper that proved itself on a Machine Translation task, but ended up being a seminal paper for all of natural language processing research. Transformers are now the backbone of many LLM applications such as ChatGPT. Transformers end up being extensible to many modalities and are used as a component of the Sora architecture.

Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Vision Transformer (ViT)

"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" was one of the first papers to apply Transformers to image recognition, proving that they can outperform ResNets and other Convolutional Neural Networks if you train them on large enough datasets. This takes the architecture from the "Attention Is All You Need" paper and makes it work for computer vision tasks. Instead of the inputs being text tokens, ViT uses 16x16 image patches as input.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Latent Diffusion Models

"High-Resolution Image Synthesis with Latent Diffusion Models" is the technique behind many image generation models such as Stable Diffusion. They show how you can reformulate the image generation as a sequence of denoising auto-encoders from a latent representation. They use the U-Net architecture referenced above as the backbone of the generative process. These models can generate photo-realistic images given any text input.

High-Resolution Image Synthesis with Latent Diffusion Models
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .


"Learning Transferable Visual Models From Natural Language Supervision" often referred to as Contrastive Language-Image Pre-training (CLIP) is a technique for embedding text data and image data into the same latent space as each other. This technique helps connect the language understanding half of generative models to the visual understanding half by making sure that the cosine similarity between the text and image representations are high between text and image pairs.

Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.


According to the technical report, they reduce the dimensionality of the raw video with a Vector Quantised Variational Auto Encoder (VQ-VAE). VAEs have been shown to be a powerful unsupervised pre-training method to learn latent representations.

Neural Discrete Representation Learning
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of “posterior collapse” -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

The Sora technical report talks about how they take in videos of any aspect ratio, and how this allows them to train on a much larger set of data. The more data they can feed the model without having to crop it, the better results they get. This paper uses the same technique but for images, and Sora extends it for video.

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

Diffusion Transformer (DiT)

The team at OpenAI states that Sora is a "Diffusion Transformer" which combines many of the concepts listed in the papers above. The DiT replaces the U-Net backbone in latent diffusion models, with a transformer that operates on latent patches. This paper generates state of the art high quality images by using a more efficient model than a U-Net, therefore increasing the amount of data and compute that could be used to train these models.

Scalable Diffusion Models with Transformers
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

Video Generation Papers

They reference a few video generation papers that inspired Sora and take the generative models above to the next level by applying them to video.

ViViT: A Video Vision Transformer

This paper goes into details about how you can chop the video into "spatio-temporal tokens" needed for video tasks. The paper focuses on video classification, but the same tokenization can be applied to generating video.

ViViT: A Video Vision Transformer
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen is a text-conditional video generation system based on a cascade of video diffusion models. They use convolutions in the temporal direction and super resolution to generate high quality videos from text.

Imagen Video: High Definition Video Generation with Diffusion Models
We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

This paper takes the latent diffusion models from the image generation papers above and introduces a temporal dimension to the latent space. They apply some interesting techniques in the temporal dimension by aligning the latent spaces, but does not quite have the temporal consistency of Sora yet.

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

Photorealistic video generation with diffusion models

They introduce W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. This feels like the closest technique to Sora in the reference list as far as I can tell, and was released in December of 2023 by the teams at Google, Stanford and Georgia Tech.

Photorealistic Video Generation with Diffusion Models
We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

Vision-Language Understanding

In order to Generate Videos from text prompts, they need to collect a large dataset. It is not feasible to have humans label that many videos, so it seems they use some synthetic data techniques similar to those described in the DALL·E 3 paper.


Training text-to-video generation systems requires a large amount of videos with corresponding text captions. They apply the re-captioning technique introduced in DALL·E 3 to videos. Similar to DALL·E 3, they also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model.

DALL·E 3 understands significantly more nuance and detail than our previous systems, allowing you to easily translate your ideas into exceptionally accurate images.


In order for the model to be able to follow user instructions, they likely did some instruction fine-tuning similar to the Llava paper. This paper also shows some synthetic data techniques to create a large instruction dataset that could be interesting in combination with the Dalle methods above.

Visual Instruction Tuning
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Make-A-Video & Tune-A-Video

Papers like Make-A-Video and Tune-A-Video have shown how prompt engineering leverages model’s natural language understanding ability to decode complex instructions and render them into cohesive, lively, and high-quality video narratives. For example: taking a simple user prompt and extending it with adjectives and verbs to more fully flush out the scene.

Make-A-Video: Text-to-Video Generation without Text-Video Data
We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.


We hope this gives you a jumping off point for all the important components that could make up a system like Sora! If you think we missed anything, feel free to email us at hello@oxen.ai.

It is by no means a light set of reading. This is why on Fridays we take one paper at a time, slow down, and break down the topics in plain speak so anyone can understand. We believe anyone can contribute to building AI systems, and the more you understand the fundamentals, the more patterns you will spot, and better products you will build.

Community Resources | Oxen.ai
Manage your machine learning datasets with Oxen AI.

Join us on a learning journey either by signing up for ArXiv Dives or simply joining the Oxen.ai Discord community.

Join the oxen Discord Server!
Check out the oxen community on Discord - hang out with 700 other members and enjoy free voice and text chat.

Best & Moo the Oxen.ai herd.