For the third meeting of the Vision-Language Club we were greatful to host Hila Chefer for a talk on Google's new text-to-video diffusion model - Lumiere!
In this talk, I will present Lumiere, our latest text-to-video model from Google Research. Lumiere is designed for synthesizing videos that portray realistic, diverse and coherent motion - a pivotal challenge in video synthesis. We achieve this by introducing a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution - an approach that inherently makes global temporal consistency difficult to achieve.
During this talk, we will delve into the Space-Time U-Net architecture proposed by Lumiere, comparing it to existing text-to-video models. Additionally, we will explore the broad range of applications facilitated by our model, including image-based generation, video stylization, video editing, cinemagraphs, and more.
Hila is a PhD candidate at Tel Aviv University advised by Prof. Lior Wolf, and a research intern at Google in Tel Aviv. Her research is centered around computer vision and multi-modal learning. Her works focus particularly on developing methods to understand deep neural networks and leveraging their insights to enhance the expressiveness, robustness, and fairness of the model.
2 Comments