University of Washington turns still photos into videos

22 Jun 2021

Deep-learning extrapolates movement from pixels of stationary image.

Motion picture: recreated movement

A project at the University of Washington (UW) has demonstrated that a still image can be used to generate a moving image of the same scene through a deep-learning operation.

As presented at the Conference on Computer Vision and Pattern Recognition (CVPR) in June 2021, the technique involves a looped extrapolation around the image, and so is currently suited to scenarios where a still snapshot has captured an instant of a continuous flowing motion.

This proof-of-principle breakthrough, made by a project involving UW and Facebook's Computational Photography group, could indicate a route to uses in VR/AR or other applications where realistic motion and a convincing visual environment need to be conveyed to a viewer.

"What's special about our method is that it doesn't require any user input or extra information," said Aleksander Hołyński of the University's Paul G. Allen School of Computer Science & Engineering.

"All you need is a picture. And it produces as output a high-resolution, seamlessly looping video that quite often looks like a real video."

The project's conference paper indicates that the process is designed to be applied to objects whose motion is well approximated by Eulerian motion, a modelling approach which focuses on particular locations in space through which particles move, rather than on movement of the individual particles themselves.

In the UW project these objects included smoke, water and clouds - scenarios in which particle motion takes place through a static velocity field. A neural network, trained using pairs of images and motion fields, can then define each source pixel's trajectory, so that a future frame can be extrapolated.

Dynamic animation from symmetric splatting

One key aspect is the training of the deep-learning network, a process in which the network was asked to guess the motion of objects in a moving video when only given the first frame. After comparing its prediction with the actual video, the network learned to identify visual clues to help it predict what happened next. The team's system then uses that information to determine if and how each pixel should move.

"It effectively requires you to predict the future," Hołyński said. "And in the real world, there are nearly infinite possibilities of what might happen next."

The team also built on a technique termed "splatting," developed in the 1990s as an alternative to ray-casting solid modelling. Splatting operates on a per-object rather than per-pixel basis, aiming to consider each volume element in the data set and work out how it affects a final image.

UW employed "symmetric splatting," a method to predict motion in both forward and backwards directions, effectively into the future and the past for an image, and then combine them into one animation.

"Integrating information from both of these animations ensures that there are never any glaringly large holes in the warped images," according to the UW project.

Additional tweaks, including transitioning different parts of the frame at different times and deciding how quickly or slowly to blend each pixel depending on its surroundings, add to the verisimilitude of the movement created.

After trials using images of beaches, waterfalls and flowing rivers, whose motion is particularly suited to the modelling approach, the project intends to investigate applying the same principles to more complex scenarios.

"We would love to extend our work to operate on a wider range of objects, like animating a person’s hair blowing in the wind," commented Hołyński. "I’m hoping that eventually the pictures that we share with our friends and family won’t be static images. Instead, they’ll all be dynamic animations like the ones our method produces."