Abstract
In recent years, there has been growing interest in creating interactive and immersive 3D experiences from the photos and videos we have on our smartphones. In free-viewpoint video and photographs, the viewer can interactively choose a viewpoint in 3D space to view a scene from an arbitrary perspective. However, existing methods for creating 3D free-viewpoint content either require complex camera equipment, prohibitively expensive computer hardware, or are limited to specific types of video (e.g., TV broadcasts of football matches).
The main aim of this thesis was to develop an approach for creating 3D, free-viewpoint video from general monocular video that is suitable for virtual reality, can be run on consumer hardware and runs in a reasonable amount of time relative to other methods. To achieve this we developed HIVE (Home Immersive Video Experience), a mesh-based approach which leverages structure-from-motion and deep learning models for monocular depth estimation, instance segmentation and image inpainting. Our approach builds upon and extends a method called Soccer on Your Tabletop which converts TV broadcasts of football matches to dynamic 3D mesh videos. To apply this approach to casual video we introduced a method for recovering metric-scale pose data from structure-from-motion (COLMAP) and estimated depth data, a method to get TSDF fusion working with estimated depth data for the background mesh reconstruction as well as other improvements. We implemented a web-based viewer which enables users to view our free-viewpoint video on a regular computer display, mobile device or in stereoscopic 3D through a virtual reality headset.
Our experiments showed that HIVE can create free-viewpoint video from monocular video on consumer hardware relatively quickly. We found that our approach is substantially faster than state-of-the-art monocular Neural Radiance Field (NeRF) free-viewpoint video techniques, as well as using substantially less GPU memory and being significantly more accessible to the general public in terms of hardware requirements. We also showed that the visual quality of the free-viewpoint videos produced by HIVE compares favourably with these NeRF-based techniques. We validated our approach to recovering metric-scale pose data with estimated depth data and showed that it is accurate. Through a user study, we find that HIVE improves users’ sense of presence in virtual reality.
Many approaches for creating free-viewpoint photographs and videos, including HIVE, rely on deep convolutional neural networks for depth estimation which often have high GPU memory usage and thus require expensive computer hardware. As part of our aim of ensuring that approach can be run on consumer hardware, we also investigated monocular depth estima- tion with the aim of developing a more efficient model architecture. We ran experiments where we replaced the encoder portion of state-of-the-art depth estimation models that follow the encoder-decoder architecture with more efficient models. Our experiments showed that the efficiency gains of efficient image classification models can be transferred over to depth estimation models with a minimal loss of accuracy.
However, the metrics used to evaluate the accuracy of depth maps are not intuitive, making it difficult to understand the impact that lower accuracy would have on 3D content created with these depth maps. We created free-viewpoint photographs from our efficient depth estimation models and ran further experiments to better understand the relationship between the quality of depth maps, as measured by depth estimation metrics, and the visual quality of free-viewpoint photographs created with these depth maps. We found that image similarity metrics that measure changes in geometry are more reliable than metrics that simply average pixel value differences, and more closely align with human judgements.
Overall, HIVE provides a way for creating free-viewpoint videos from monocular video that is accessible to people running consumer-grade computer and camera hardware. We make the source code for HIVE publicly available at https://github.com/AnthonyDickson/HIVE.