The Middle Table AI Challenge: How it went...

So you’ve seen the result of the AI challenge (if you haven’t, go here for the final film, and here for the behind-the-scenes film).

Now let me ask you a question: how do you use a watch? If you’re not sure, I can show you someone doing it. Here they are. Their watch is wrapped around their wrist, sitting comfortably next to their wrist bone. They’ve lifted their arm into a position roughly perpendicular to the ground and then rotated their wrist towards them to bring the watch face into line with their own. Their head is tilted slightly downwards and their eyeline has dipped a little further still, to be able to focus on the watch. They have a neutral expression.

If that seems like a torturous description of something incredibly mundane, what it does in fact tell us is two things: first is that even a seemingly simple action can be quite involved when broken down into its constituent parts; second is that there is an awful lot for AI to have to learn when it comes to understanding and recreating the very acts that make us human.

A still frame from the AI Challenge film showing the imaginary Xaltris A-200 smartwatch, as envisaged by Runway

The “Xaltris A-200 smartwatch”, as imagined by Runway

All of which is, in itself, a long-winded way of describing the biggest challenge we faced when trying to make a film using only AI tools. The brief was to make an ad for a futuristic smartwatch and was designed deliberately to force the team to replicate the sorts of scenes they’d see in an ordinary, traditionally made video. So, alongside any AI magic they could generate, we wanted shots of people using the watch. How would AI get on?

The answer is, it has to be said, not very well at all, as you can see in the final version. It’s important to say that this wasn’t a scientific trial. The team had one week, were effectively learning on the job, and couldn’t test every available program to its maximum extent. But a week feels like enough time to gauge the good, the bad, the intuitive and the head-bangingly impossible when it comes to AI as of mid-2024. There was a fair bit of each of those things.

AI showed its best side during pre-production and was mostly the sort of stuff you’ll know about. The likes of ChatGPT can turn a simple prompt - like “we’re a production company, we’ve been given this brief, we need to make an ad, please tell us some cool ways to do it” - into a perfectly workable set of ideas in no time. The scripts and storyboards it generated from those ideas were also very good, certainly as a V1. Crucially, there was nothing wrong with them, and it took seconds with only a beginner’s knowledge of prompting.

A still image from the AI Challenge behind-the-scenes video, showing ChatGPT producing ideas based on a brief.

ChatGPT produced a list of workable ideas for the brief within a matter of seconds

Where it fell short was with generating imagery, in particular moving footage. The software that produced the best initial images - ie, that interpreted prompts most successfully - were not so great at animating it, and vice versa. Instead, the majority of moving shots in the film were the result of a combination of programs, starting out as images generated by Midjourney before being animated in Runway. 

The reason for AI’s struggle is its fundamental flaw: it doesn’t really understand what it’s generating. Sure, it can dig deep into its vast array of references to generate a realistic enough picture of, say, a beautiful sunflower. But it can’t depict the intricate dance of its petals as the sunflower opens, because it doesn’t understand what those petals are doing. It also knows a person might get from A to B by walking. But its best guess of the mechanics of walking is about 80% accurate, and it turns out that the other 20% is quite crucial if you don’t want to see a mangle of bones. It’s the same for things like the way cars drive, birds fly, and how people look at smartwatches.

There are a growing number of AI programs that profess to replicating other human aspects. As you’d imagine, the appreciable benefits they offer sit alongside more worrying implications. Elevenlabs is a text-to-speech generator that, alongside its library of generic sounding voices, allows you to replicate any existing voice - accent, intonations and all. The output could easily pass for real even if, on closer listening, they’re perhaps ‘only’ 90% accurate and lack that tangible warmth and sparkle. It is enough to worry voiceover artists everywhere. Synthesia, meanwhile, goes a step further in offering lifelike avatars that could one day do away entirely with the need for physical presenters. For the time being, their animation only serves to highlight how the interplay of facial muscles is even more complex than those which control the legs.

One of Synthesia’s lifelike (at least, in still form) avatars © Synthesia

It’s not that AI is bad. Far from it. You don’t have to think too hard to imagine where its multitude of features can be beneficial, and given AI’s exponential progress those features will only become more useful in time. We’ve already got a glimpse of the next generation of Gen AI thanks to videos made using Sora - OpenAI’s as-yet-not-publicly-available foray into footage generation. The flaws appear to have been largely ironed out and it appears capable of an impressive level of realness and even humanity. But even those videos have that unmistakably ethereal, over-smooth, AI feel to them, and production teams haven’t hidden behind the fact they required multiple non-AI interventions to get them over the line. The sheer fact that it takes teams to make films with Sora also says a lot.

At the start of the challenge we’d kind of expected AI to be easy to grasp and able to generate brilliant footage from the most basic inputs. Indeed, it’s what AI’s proponents want and need you to believe. It’s a matter of time before it’s the case. But until then, we’ll still need people to put in the legwork in more senses than one. Because when it comes to being human, those intangible qualities of experience, endeavour and creativity are as vital as they are difficult to replicate.