Abstract
This thesis explores localization and tracking within Augmented Reality (AR) for sports spectating. Localization, which involves determining a device’s position and orientation, is crucial for accurately overlaying virtual content onto the real world. Existing localization and tracking methods for Augmented Reality (AR) often struggle to provide accurate and reliable results in large-scale, dynamic environments, such as sports stadiums. These environments present significant challenges for camera localization due to their large scale and frequent dynamic changes, such as moving crowds and shifting lighting conditions. These factors hinder the accurate alignment of virtual content with the real world, limiting the development of immersive and engaging AR experiences.
While many existing methods excel in small, static environments, they struggle in large-scale, dynamic settings. State-of-the-art techniques like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM) require substantial camera motion to accurately reconstruct the scene and estimate the camera pose. In AR use cases, users might primarily rotate rather than move broadly, making traditional SLAM struggle to localize and track. Spherical Localization and Tracking (SPLAT) has been proposed as a potential solution, building on the shortcomings of conventional visual SLAM in huge environments with rotational movement. However, in real-world scenarios, people’s movements are unpredictable; therefore, the SPLAT method struggles when the rotational movement assumption is not met.
This can limit their effectiveness in applications where translational motion is minimal relative to the environment’s size. Deep learning methods have gained popularity as a research topic across all computer vision domains. However, these deep-learning methods rely on high-end GPU acceleration, which is generally inaccessible in consumer-grade smartphones, and their potential for AR is unclear.
To begin with, we investigated localization, a crucial component for establishing global context in many AR applications. We collect our training datasets in large outdoor areas and test datasets with various dynamic conditions. We developed a system for conducting on-site and off-site experiments named Dynamic Large-Scale Localization & Tracking (DLSLAT).
We evaluate the performance of state-of-the-art deep learning localization techniques in large-scale environments. We trained our system with three large sample datasets (a stadium, a clock tower and a courtyard).
We conduct a comprehensive technical benchmarking of five localization methods on dynamic datasets to assess each method’s performance under varying conditions. This thesis demonstrates the limitations of deep learning methods Hierarchical Localization (H-Loc), 6D Camera Localization via 3D Surface Regression (DSAC++), and Accelerated Coordinate Encoding (ACE) in dynamic environments while establishing Expert Sample Consensus Applied to Camera Re-Localization (ESAC) method trained with ten experts as a reliable method for localization and tracking in large-scale dynamic settings.
A user study was essential to validate our experimental procedure to assess user perceptions. Therefore, we performed a user study to evaluate our DLSLAT system, utilizing the integrated ESAC localization method. Comparing the automatic ESAC approach to a manual baseline, we assessed three factors: presence, plausibility, and workload. Our evaluation demonstrates the effectiveness of the ESAC method in large-scale camera localization, achieving success rates of over 80% in most dynamic conditions, and providing insights for enhancing AR experiences.