Showing scant regard for the publishing schedule of a small (but much admired ☺️) newsletter on AI, OpenAI announced its text-to-video tool Sora last Thursday afternoon. So I’m around six-and-a-half days too late for any hot takes. Still, it’s worth taking a look at what Sora means for AI-generated video and whether the hype is deserved.
The very first Explainable was about the pretty obvious limitations of text-to-video. Things looked weird, faces melted, and limbs disappeared. Anything that worked stuck to a simple theme and didn’t try anything more ambitious than a slight pan or zoom. And even the strongest stuff was merely a pastiche of a hack-y movie trailer.
Sora is a clear step up on that. The sample results shared by OpenAI showed multiple shots in single videos that adhered to a consistent style. The OpenAI announcement acknowledged limitations around the ‘physics of a complex scene’ but the early videos generated via prompts given to Sam Altman looked decent
But once the Sora videos began to trend on social media the sober analysis was harder to find. “I think we are very near a time when we rarely watch a movie in common. Instead we will create a prompt for the type of movie we want to see, it'll be generated, and potentially never watched again by anyone.” read one post.
“Imagine your favorite books coming to life as movies! Once they figure out consistent characters, we can watch books that were never made into films, maybe with our own aesthetic edits,” read another.
There are two rejoinders to this type of hype.
One: there is no evidence to suggest we are close to that level with text-to-video. Even if Sora 2.0 proves to be a massive leap forward, any coherent visual storytelling would require a massive workload to iron out errors and achieve consistency between each clip. Anything longer than a few minutes would maybe, at best, be appealing to a group of students as they first experiment with drugs.
Two: even if we are close to a future where a single prompt sorts out bespoke viewing for the night, that would not be a good thing. There are all sorts of reasons why this type of output would be soulless, unengaging, unoriginal and, well, crap. But the pithiest response I saw, in a week of hot AI takes, was from the writer Dan Sheehan who described the basic generative AI pitch as, "what if art could make you feel more alone?"
There is, of course, a gap between what OpenAI are pitching and what AI hype bros are selling. As stated, the OpenAI press release was relatively restrained. But the billions of dollars investors are pouring into AI at present are based, in part, on buying into the social hype.
Ryan Broderick wrote a piece for Fast Company this week (quoting yours truly) on the prospect of AI search engines and whether anyone really wants or needs them. Reliably cynical AI expert Gary Marcus is quoted in the piece saying generative AI is “a solution in search of a problem”. I would reframe that slightly. Generative AI is a solution to lots of small problems looking for massive money-spinning problems. If text-to-video helps creators, even with the mundane stuff, then it will make money. It doesn’t need to end Hollywood or fundamentally change how we consume media. But investors are looking for that peak Google/Meta unprecedented earnings payout, not a Photoshop-is-a-profitable-arm-of-the-business payout.
That’s likely going to lead to AI being forced on consumers, and creators, whether it helps or not. And that will likely include more big promises about text-to-video replacing the lovingly crafted thing you are currently streaming.
Small Sora Bits #1
As always with these releases it’s more fun to check out the flaws than to marvel at the wins. Matt Shumer compiled a good thread linking the things Sora can’t quite grasp. In short there are some struggles with physics, cause-and-effect (birthday candles will not blow out), and some odd hallucinations. To quote Gary Marcus for a second time this week, the physics flaws can be explained thus: “Sora is generalizing patterns of pixels, not patterns of objects in the world”.
Small Sora Bits #2
A product of social media engagement farming is this weird inverse of the potential problem of deepfakes: real videos are miscaptioned as Sora content. Unsure if something is AI or not? The above video appears online long before Sora via a reverse image search.
Small Sora Bits #3
Speaking of reverse image search, the type of techniques that are used to verify content coming from warzones can also be used to uncover possible source material for some Sora videos. The excellent Magaret Mitchell of Hugging Face uncovered a photographer and a Reddit user who may have influenced one of the most popular Sora sample videos. These types of deep dives, linking real creators to AI output, could be valuable in the fight for artist compensation.