Text to Video Generative AI Is Finally Here and It’s Weird as Hell

Runway plans its Gen-2 text to video AI release, but the janky clip generator ModelScope may be the first AI video generator to catch the internet's attention.

PublishedMarch 22, 2023

We may earn a commission from links on this page.

Start Slideshow

What could the AI be looking at, I wonder.

Gif: Runway/Gizmodo

I like my AI like I like my foreign cheese varieties, incredibly weird and full of holes, the kind that leaves most definitions of “good” up to individual taste. So color me surprised as I explored the next frontier of public AI models, and found one of the strangest experiences I had since the bizarre AI-generated Seinfeld knockoff Nothing, Forever was first released.

Runway, one of the two startups that helped give us the AI art generator Stable Diffusion, announced on Monday that its first public test for its Gen-2 AI video model was going live soon. The company made the stunning claim it was the “first publicly available text-to-video model out there.” Unfortunately, a more obscure group with a much jankier initial text-to-video model may have beat Runway to the punch.

Google and Meta are already working on their own text-to-image generators, but neither company has been very forthcoming on any news since they were first teased. Since February, the relatively small 45-person team at Runway has been known for its online video editing tools, including its video-to-video Gen-1 AI model that could create and transform existing videos based on text prompts or reference images. Gen-1 could transform a simple render of a stick figure swimming into a scuba diver, or turn a man walking on the street into a claymation nightmare with a generated overlay. Gen-2 is supposed to be the next big step up, allowing users to create 3-second videos from scratch based on simple text prompts. While the company has not let anybody get their hands on it yet, the company shared a few clips based on prompts like “a close up of an eye” and “an aerial shot of a mountain landscape.”

Few people outside the company have been able to experience Runway’s new model, but if you’re still hankering for AI video generation, there’s another option. The AI text to video system called ModelScope was released over the past weekend and already caused some buzz for its occasionally awkward and often insane 2-second video clips. The DAMO Vision Intelligence Lab, a research division of e-commerce giant Alibaba, created the system as a kind of public test case. The system uses a pretty basic diffusion model to create its videos, according to the company’s page describing its AI model.

ModelScope is open source and already available on Hugging Face, though it may be hard to get the system to run without paying a small fee to run the system on a separate GPU server. Tech YouTuber Matt Wolfe has a good tutorial about how to set that up. Of course, you could go ahead and run the code yourself if you have the technical skill and the VRAM to support it.

ModelScope is pretty blatant in where its data comes from. Many of these generated videos contain the vague outline of the Shutterstock logo, meaning the training data likely included a sizable portion of videos and images taken from the stock photo site. It’s a similar issue with other AI image generators like Stable Diffusion. Getty Images has sued Stability AI, the company that brought the AI art generator into the public light, and noted how many Stable Diffusion images create a corrupted version of the Getty watermark.

Of course, that still hasn’t stopped some users from making small movies using the rather awkward AI, like this pudgy-faced Darth Vader visiting a supermarket or of Spider-Man and a capybara teaming up to save the world.

As far as Runway goes, the group is looking to make a name for itself in the ever-more crowded world of AI research. In their paper describing its Gen-1 system, Runway researchers said their model is trained on both images and video of a “large-scale dataset” with text-image data alongside uncaptioned videos. Those researchers found there was simply a lack of video-text datasets with the same quality as other image datasets featuring images scraped from the internet. This forces the company to derive their data from the videos themselves. It will be interesting to see how Runway’s likely more-polished version of text-to-video stacks up, especially compared to when heavy hitters like Google show off more of its longer-form narrative videos.

If Runway’s new Gen-2 waitlist is like the one for Gen-1, then users can expect to wait a few weeks before they fully get their hands on the system. In the meantime, playing around with ModelScope may be a good first option for those looking for more weird AI interpretations. Of course, this is before we’ll be having the same conversations about AI-generated videos that we now do about AI created images.

The following slides are some of my attempts to compare Runway to ModelScope and also test the limits of what text to image can do. I transformed the images into GIF format using the same parameters on each. The framerate on the GIFs is close to what the original AI-created videos.

Previous Slide

Next Slide

2 / 10

Comparing Runway Gen-2 to ModelScope: an eye

Gif: Runway/ModelScope/Gizmodo

We only have a few examples of videos that Runway saw were generated from its Gen-2 system, so why don’t we compare them to the less-restrictive ModelScope. This first example is from the prompt “A close up of an eye.”

Runway’s version is obviously a much bigger resolution and higher fidelity, and check out that Shutterstock watermark hanging over Modelscope. Still, both are surprisingly lifelike, if just because they both seem surprised to be here.

Previous Slide

Next Slide

3 / 10

Comparing Runway Gen-2 to ModelScope: a mountain

Gif: Runway/ModelScope/Gizmodo

Let’s now compare the two models with the prompt “An aerial shot of a mountain landscape.” Again Runway’s promotional vid is pretty convincing, if a still pretty grainy.

Previous Slide

Next Slide

4 / 10

Comparing Runway Gen-2 to ModelScope: a desert

Gif: Runway/ModelScope/Gizmodo

The current generation systems do seem capable of handling landscapes, though ModelScope is very zoomed in compared to its counterpart. The prompt “drone footage of a desert landscape” is obviously more sweeping in Runway’s promotional version, but you can see how the terrain warps to some extent as the video plays.

Previous Slide

Next Slide

5 / 10

Comparing Runway Gen-2 to ModelScope: an apartment

Gif: Runway/ModelScope/Gizmodo

This prompt “Sunset through a window in a New York apartment” shows off the AI’s ability to create depth in images. Of course, Runway’s version is much more serene, but the open source model makes an apartment look more like how most people experience New York’s windows, by staring at the building across from you.

Previous Slide

Next Slide

6 / 10

Comparing Runway Gen-2 to ModelScope: jungle brush

Gif: Runway/ModelScope/Gizmodo

The deciding point for how well the text-to-video AI performs will be how it represents people. This prompt for “A shot following a hiker through jungle brush” is effective, if not a little hard to make out. Runway has yet to show off people’s faces in its promotional material, but keep going through and you’ll see just how awkwardly ModelScope tries to represent human likeness.

Previous Slide

Next Slide

7 / 10

ModelScope: Shots with human faces

Gif: ModelScope/Gizmodo

The absolute strangest time I had with text to video AI was trying to get it to generate faces. Here are the prompts I used to create these strange hallucinations:

A scene from Star Trek: The Next Generation of Picard pointing and yelling
Michael Jordan dunking
Ellie from The Last of Us punches Joel
Ticketmaster crashing during pre-sale for the Eras Tour

Previous Slide

Next Slide

8 / 10

ModelScope: Inhuman characters

Gif: ModelScope/Gizmodo

You would think creating cartoon versions of characters would be easier on the AI, but not always. I went for some pretty mundane prompts, including:

Bugs Bunny eating a carrot
A quirky little robot
A cartoon villain twirling his mustache
An alligator smoking a pipe

Now, the AI may not have any training off of Looney Toons classic character, but it still created a little bit of nightmare fuel for me to chew on. The villain doesn’t even try to twirl his ‘stache, and the alligator, while looking pretty accurate, has no pipe in sight.

Previous Slide

Next Slide

9 / 10

ModelScope: Let’s get weird

Gif: ModelScope/Gizmodo

Let’s say I wanted to test this particular AI’s limits. Unfortunately, those limits come hard and fast. My prompts included:

Two people playing rock paper scissors
Financial bro in business suit huffing on a vape outside a New York city skyrise
Timelapse of man sitting at a park
Batman throwing a molotov cocktail

AI image generators usually have a tough time displaying realistic hands, and it’s clear that this open source model struggles with it as well. Instead of putting a vaping financial bro on the ground, the AI literally put him in front of a city skyscraper, but hovering in the air. The AI accidentally stuck a man sitting on a park bench off the bench, hovering there like a mime. Then it created perhaps the best Batman crime fighting gadget since the batarang, the Bat-molotov hip flask.

10 / 10

Show all 5 comments