Are We a Simulation?

OpenAI's Release of Its Text-to-Video AI — Sora — Has Us Pondering a Long-Lived Silicon Valley Theory...

Last March, a new text-to-video software was released into the wild on a popular artificial intelligence hub known as Hugging Face.

The software, ModelScope, was unique in that it was the first major open source software released for text-to-video generation. Trained on 1.7 billion parameters made it stand out due to its large training set.

It was also unique in that it used an approach to video generation known as a diffusion model…

After training, the software could ingest a text prompt, generate a string of “noisy” low resolution images, refine those images, and ultimately generate a few seconds video that roughly represented the text prompt.

Within hours of the software being released, one of the product designers at Hugging Face released a generative video of a Star Wars clip that caught the community by storm.

Not Quite Right

Shown below, the short video clip was reminiscent of Darth Vader and Obi Wan somewhere on the planet of Tatooine. The video was jerky and jittery, but also at moments seemingly realistic.

As odd and unappealing as the video may look to us today, it was a remarkable step forward at the time. For AI software to be able to create 10 seconds of video of that quality after training on a large database of video inputs was extraordinary.

Naturally, the AI community went wild experimenting with the software after seeing what was already possible.

What came next was odd to say the least.

Below is an example of actress Scarlett Johansson eating. For some it can be hard to watch because it is so disconcerting.

March 2023 Example of Distorted Text-to-Video AI

There were many examples similar to the one shown above. They were described as creepy, unnerving, and even cursed.

The contortions of the mouth, the hands, the fingers, and face were nowhere near being normal. Some might even remind us of a horror movie. So much so that watching can even give us some discomfort in our stomach.

And yet, it was a breakthrough.

For those in the industry, it represented an exciting path forward using diffusion models for text-to-video generative AI.

I knew back then that it would only be a matter of time and money before the technology radically improved.

The 3 Challenges

I fully expected the breakthrough to come from one of the major players that had the financial resources to train a more advanced model.

After all, there were three major challenges to overcome in order to improve the technology:

There were no high-quality datasets available to train an AI for text-to-video generation available. If we look closely at the Star Wars and Johansson clips, we’ll see a watermark — Shutterstock. That simply shows us that the ModelScope AI was trained on free video clips off the internet.
Aside from the absence of “clean” watermark-free video, there was a lack of video that had high-quality annotations that were descriptive of the individual videos. This is critical to improving overall performance of the text-to-video AI.
And assuming the above two were available, it would be extremely expensive to train a massive general diffusion model to create longer-form, realistic video simply from a text prompt.

It was clear though that these problems were solvable. Which is why the breakthrough from OpenAI over the last few days came as no surprise.

But that didn’t make it any less exciting.

SORA

OpenAI announced the release of its text-to-video AI — Sora — capable of generating a minute of high-resolution video based on just about any text prompt imaginable. You’ll have to see it to believe it below. My jaw dropped the first time I saw the capabilities of Sora.

I hope you’ll agree, it’s a stunning video clip. Worth seeing in full resolution here on the OpenAI/Sora homepage. (Please scroll down to the Art Gallery video, then swipe to video No. 2 in that section.)

On the first viewing, it feels incredibly real, almost like its footage was shot from a drone flying over a cozy store-lined sidewalk bordered by a street and a river.

For reference, the text prompt that was used to create the video is below:

Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous Sakura petals are flying through the wind along with snowflakes.

Having lived in Tokyo for 20 years, the clip instantly reminded me of the fantastic city, and yet… something felt immediately off to me.

I didn’t recognize the location. After all, it wasn’t real. And I immediately recognized that the kanji and katakana on the signs were nothing but gibberish.

And the more times that we view the clip, we can start to see small artifacts in the video. We might notice that the cherry blossoms appear to be floating without any tree limbs. Or that the dimensions of the people walking compared to the sidewalk or stores feel off.

We can clearly see that Sora is a work in progress, and it is also a massive leap compared to a year ago.

OpenAI built upon the underlying technology that it uses for ChatGPT, large language models (LLMs). Rather than breaking up text/language into tokens for the purposes of training, OpenAI broke up video into smaller blocks referred to as patches.

I kind of like the patches terminology. It immediately reminded me of a patchwork quilt.

I know you might be wondering how quilts are related to artificial intelligence. Just imagine a pile of patchwork quilts, each quilt representing a frame of a video. And each frame is broken up into a blocks (patches), each of which represent a bunch of pixels.

OpenAI’s Sora breaks a training video down into patches that are interrelated both in space and time. I’m oversimplifying, but doing so allows it to understand how these patches are related to each other.

OpenAI spent a lot of time and money to annotate a massive amount of video that became the foundation for its training models. Then it used another form of AI to “watch” videos and annotate those videos with detail. This allowed OpenAI to radically expand the size of its video training set.

Then it threw raw computing power at the problem to improve performance. To see the difference that additional computational resources have on training, let’s have a look at the below example.

Example of How Training Improves with Increased Computing Power

Simply increasing the compute resources by 4X results in a video that looks fuzzy but lifelike (the video in the middle). And when increasing the compute resource by 32X, we end up with a lifelike, high-resolution video that looks like it was shot by a smartphone (the video on the right).

It’s absolutely nuts.

And it shows us that when the quality of annotated video inputs — the training set — increases along with a corresponding increase in computational resources, the quality and realism of the AI-generated video skyrockets, enabling the AI to produce longer and longer videos.

Unlocked

OpenAI’s Sora has clearly had a breakthrough in patching together blocks of images in ways that make sense in both space and time. This is referred to as being both spatially and temporally consistent. This has been a major problem to solve in generative AI.

And just look at the progress already.

Darth Vader and Obi Wan on Tatooine was produced in March 2023.

Scarlett Johansson eating spaghetti was also produced in March 2023 — 11 months ago.

And just look at what the technology has evolved into today.

No matter what assumptions we make about the forthcoming pace of improvement in generative AI applied to video, one thing is clear…

The videos are going to get longer and even higher resolution. Any jerkiness or artifacts will start to disappear. And the videos will look like they are professionally produced.

The ramifications are extraordinary. With a well-written detailed prompt, we can bring pretty much anything back to life at any point in time. As long as there are videos and/or images related to that time, a generative AI like Sora will be able to create a video of any prompt and predict with accuracy what the video should look like.

This will be an amazing tool for education, entertainment, and gaming applications. It will dramatically reduce the cost and time required to generate video content, and in time, it will empower individuals to use the technology to create video content specifically for their own pleasure.

Just imagine using the technology to create your own drama series for as many seasons as you want. Or to explore the feudal period of Japan wearing an Apple Vision Pro headset loaded with generative video from something like Sora (which ironically is Japanese for “sky”). Talk about blue sky potential!

But perhaps there is even more to this story...

Perhaps we’ve just unlocked something that we’ve been living with all along.

A long-lived theory in Silicon Valley is that we’re all “living” in a simulation. A simulation that has been created by a civilization far more advanced than us homo sapiens.

Even OpenAI’s stated purpose for Sora is “towards building general purpose simulators of the physical world.”

After all, when we think about it, almost all technological development involves simulations. We model things with software, test them on computers, optimize them, and eventually deploy the best models in real life.

Any advanced civilization would do the same.

As the theory goes, the odds of us being in a simulation far outweigh the likelihood that we just happen to be the single sentient life form in our entire universe.

We may very well be several billion AI’s interacting in an artificial world created by generative AI that is a thousand times more powerful than Sora is today.

New reader? Welcome to the Outer Limits! We encourage you to visit our FAQ, which you can access right here.

If you have any questions, comments, or feedback, we always welcome them. We read every email and address the most common threads in the Friday AMA. Please write to us here.