SIGGRAPH Asia 2024 in Tokyo, Session "Diffusing Your Videos", Dec.5, Thursday, 14:45-15:55, Hall B5 (2), B Block, Level 5
*Results best viewed on desktop
Panoramic image stitching provides a unified, wide-angle view of a scene that extends beyond the camera's field of view. Stitching frames of a panning video into a panoramic photograph is a well-understood problem for stationary scenes, but when objects are moving, a still panorama cannot capture the scene.
We present a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera. We pose panorama synthesis as a space-time outpainting problem, where we aim to create a full panoramic video of the same length as the input video. Consistent completion of the space-time volume requires a powerful, realistic prior over video content and motion, for which we adapt generative video models. Existing generative models do not, however, immediately extend to panorama completion, as we show. We instead apply video generation as a component of our panorama synthesis system, and demonstrate how to exploit the strengths of the models while minimizing their limitations. Our system can create video panoramas for a range of in-the-wild scenes including people, vehicles, and flowing water, as well as stationary background features.
We first project the input panning video on to a panoramic canvas (see figure below). We then try to complete this partial space-time volume using a generative video model with outpainting capabilities.
Since the input videos span a wider spatial and temporal range than typical generative video models' context window sizes, we use Temporal Coarse-to-Fine and Spatial Aggregation strategies to complete the video panoramas.
The registered input video (b) is temporally downsampled with temporal prefiltering. A base panoramic video is synthesized at the coarsest temporal scale (top), then gradually refined by temporal upsampling, merging, and resynthesis (c). Finally, a spatial super-resolution pass is applied and the original input pixels are merged with the result to produce the output video (d).
Spatially, we cover the video width using sliding windows and fuse them into one coherent video. To generate a sample in the overlap (red), we linearly interpolate the two predicted probability distributions (purple, orange) and sample from the aggregated distribution (brown). With a token-based method the distribution is a discrete distribution over the vocabulary. With diffusion, the distribution is a Gaussian distribution over pixel values, represented by 𝜇 and Σ.
Below we show more results on synthetic panning videos and show comparison with two flow-based video completion methods, ProPainter and E2FGVI.
@inproceedings{ma2024vidpanos,
title={VidPanos: Generative Panoramic Videos from Casual Panning Videos},
author={Jingwei Ma and Erika Lu and Roni Paiss and Shiran Zada and Aleksander Holynski and Tali Dekel and Brian Curless and Michael Rubinstein and Forrester Cole},
booktitle={SIGGRAPH Asia 2024 Conference Papers},
month={December},
year={2024}
}