Abstract
Echocardiograms provide noninvasive real-time data for assessing the structure and function of the heart and can assist in diagnosing several conditions. However, they are highly operator-dependent, often yield incomplete or suboptimal views, and can be challenging to interpret. In contrast, Cardiac Magnetic Resonance (CMR) delivers comprehensive and detailed evaluations but remains time-consuming and costly. To address these limitations, this study investigates cross-modal generative modeling for synthesizing CMR sequences directly from 2D Transthoracic Echocardiography (TTE) with temporal information. We propose a novel model that combines an autoencoder (AE) backbone for feature extraction with a vision transformer (ViT) to capture global temporal and spatial dependencies, thereby enabling the prediction of CMR sequences with preserved dynamics. The performance of this architecture is compared with alternative generative models to assess quantitative accuracy. Experimental results show that the proposed ViT-AE model with 12 layers achieved the best performance, with an MAE of 0.08, an SSIM of 0.67, and a PSNR of 18.45.