The video giant

Dr. Tali Dekel is at the forefront of generative AI research and a partner in the development of Lumiere - Google's video generator. "We want to know if machines can allow us to see the world better," she says

On the left - a photo of a couple in the kitchen, on the right - a photo created by the computerized model after it was shown the original photo along with the instruction: "Two robots dancing in the kitchen"
On the left - a photo of a couple in the kitchen, on the right - a photo created by the computer model after it was shown the original photo along with the instruction: "Two robots dancing in the kitchen"

A few years ago this would have sounded completely imaginary. Every day, millions of people around the world easily activate generative AI systems that produce texts, images and videos at breakneck speed. Some of the products look as if they were man-made, while others present sights that never existed.

The rapid progress of large language models (LLM), which after many years of development began to produce fairly complex and reliable texts, surprised even the experts in the field. As a result, the spotlight was also directed at the models that create photos and videos - and their development was accelerated. Today, these models are able to create in seconds a realistic video of a city street or a squirrel walking on the moon, when all that is required is to feed them short texts or place images in front of them as a visual source. But alongside the enormous capabilities and concerns about the risks inherent in computers with such powers, the range of operation of deep learning networks is still limited, especially in relation to video, and this is a challenge that preoccupies many scientists.

In the computer vision research laboratory of Dr Tali Dekal From the Department of Computer Science and Applied Mathematics at the Weizmann Institute of Science, they strive to break through the limitations of the creating machines, and try to bring them to the human level and maybe even beyond it. "I define our field of research as Re-Rendering Reality, meaning a re-creation of the visual world using computational tools," says Dr. Dekel. "We analyze images and videos and focus on certain elements of them, then create a new version of the image or video with different characteristics. My goal is to enrich the way we see the world, to allow us more creativity and even a new interaction with visual information. During the research, we raise interesting questions, such as 'Can machines allow us to see the world better?'" she adds. 

Alongside her work at the Weizmann Institute, Dr. Dekel is also a researcher at Google. While at the Weizmann Institute she focuses on breaking through the limitations of existing artificial intelligence models, at Google she is a partner in the development of new models, such as the groundbreaking "Lumiere" video model whose products were recently revealed to the general public. Lumiere is able to produce a rich and impressive variety of videos or edit existing videos according to instructions fed to it as a short sentence or as a reference image. For example, a series of videos shows how a woman running in the park becomes a figure made of wooden blocks, colored rectangles or flowers. When Lumiere received an image of an old steaming locomotive traveling on a railroad, with a marking of the section of the image containing the smoke, the computer model created a partially animated image in which only the smoke moved - and this reliably in relation to the other parts of the image that remained unchanged. In other amusing examples, da Vinci's Mona Lisa yawns, and the girl with the pearl earring from Vermeer's painting smiles.

"Lumiere is a model of text-to-video, which creates videos with realistic, varied and coherent movement - a prominent challenge in creating videos", write the researchers, including Dr. Dekel, inArticle which presents the model. The uniqueness of Lumiere is the ability to create a complete sequence of frames without breaks between them, compared to other models that first produce central and distant frames on the continuum of time and space, and only then complete the movement that occurs between them. Because of this, in the other models there is difficulty in maintaining reliable and convincing movement, while Lumiere is able to create complete sequences of movement of extremely high quality.

But how do deep learning models manage to perform these magics? It turns out that this is not entirely clear to scientists either. Dr. Dekel explains: "The field of creative artificial intelligence has undergone a paradigm shift. In the recent past, models were much smaller, simpler and designed to solve specific tasks, often by using labeled information. For example, in order to teach a computer to recognize objects in images, it was necessary to show it a collection of images in which they are labeled and explain to it that here is a car, there is a cat, and so on. Today, the models have grown and improved and are able to learn from a huge amount of information, without human labeling. The models learn a universal representation of the visual world that can be used for a variety of tasks, and not just for the specific task they were trained for in the first place." But while the perfection of the self-learning ability of the models is evident, we still do not know exactly how they work. "Significant parts of neural intelligence networks are 'black boxes' for us," adds Dr. Dekel. The enigma becomes more acute when it comes to models that create videos, because each second of a video consists of about 25 different images, and therefore the size of the computer networks required for this, and the computational challenges they face, become even greater in relation to models that create texts or images - and thus the range of operation that is not Understandable for researchers.

For Dr. Dekel, the "black boxes" of the models are a fruitful opportunity for research: "During the process of self-study, the models gained tremendous knowledge about the world. As part of the research on recreating reality with digital tools, we try to produce new products from the existing models almost without changing them, but only by better deciphering their methods of operation while trying to reveal new tasks that they are capable of performing," says Dr. Dekel About the research in which Dr. Shai Bagon from the Weizmann Institute of Science, Yoni Kasten from Envidia and the students Omer Bar Tal, Narek Tomanian, Michal Geyer, Raphael Friedman and Dana Yatim are partners.

The researchers in Dr. Dekel's lab are looking for smart processing methods that include breaking down the content into simpler components, such as an image showing the background of the video and other images each dedicated to objects that change during the video. This separation makes editing very easy: instead of processing a huge number of pixels, only one image is edited and all the other frames change accordingly. For example, if the color of a dress changes in one frame, the model understands how to update the change in the entire video so that continuity is maintained. Another challenge that preoccupies the researchers stems from the fact that many products of the models do not look reliable and the objects that appear in them move differently than we would expect based on our experience in the world.

As part of the efforts to get the models to produce videos in which the movement is consistent and logical, Dr. Dekel's lab showed how to expand the ability of a model that produces an image based on text - so that it can also create and edit videos. For example, they fed an open-source model called Stable Diffusion a video of a wolf turning its head from right to left, and asked it to create a similar video of a wolf-like ragdoll. At first the video created by the model looked fragmented and unreliable, but by identifying the representations of the various components in the images and a deeper understanding of the instructions that must be fed to the model - the researchers were able to create a video in which the wolf puppet moves convincingly.

On the left - a photo of a couple in the kitchen, on the right - a photo created by the computerized model after it was shown the original photo along with the instruction: "Two robots dancing in the kitchen"
On the left - a photo of a couple in the kitchen, on the right - a photo created by the computer model after it was shown the original photo along with the instruction: "Two robots dancing in the kitchen"

Dr. Dekel recently received a grant from the European Research Council (ERC) for young researchers in the amount of 1.5 million euros. As part of the grant, she plans to deal with additional limitations that hinder the models on their way to creating and editing videos. Due to the great complexity of video processing, there is a significant gap between the knowledge that such a model has gained from the many videos with which it has been trained, and the unique characteristics of movement in a particular video that the model is asked to create. Dr. Dekel will try to develop a model that can better infer from his accumulated experience on thousands of different videos about the needs of a single video.

And what about the concerns about the tremendous power inherent in the computer models? Dr. Dekel says: "There is a delicate balance between the awareness of the impact of technology, the risks inherent in it, and the desire to promote it, and this is our commitment to maintain that. It may sometimes seem to the general public as if the models are omnipotent, but this is not the case today. My main goal as a researcher is to expand the creative possibilities available to everyone, even those who are not professionals, and to promote science and the computational ability to see the world."

More of the topic in Hayadan: