Internet of Things (IoT) devices possess valuable but high-risk data in various modalities, calling for the multimodal decentralized machine learning schema. Though several multimodal federated learning (MFL) methods have been proposed, most of them naively overlook the system heterogeneity across IoT devices, resulting in the inadaptability to real world applications. Aiming at this, we conduct extensive experiments in real world scenarios and uncover the significant fact that stragglers caused by system heterogeneity is fatal to MFL, resulting in the catastrophic time overhead. Motivated by this, we propose a novel Multimodal Federated Learning with Accelerated Knowledge Distillation (MFL-AKD) framework, which is the first attempt to integrate knowledge distillation to combat stragglers in complex multimodal federated scenarios. Concretely, given the pretrained large-scale vision-language models deployed in the central server, we apply a quick knowledge transfer strategy before full batch updating in all clients. Extensive evaluations have been conducted on two datasets for video moment retrieval and two datasets for text-image retrieval, and the results demonstrate that our proposed framework achieves significant time gain while maintaining high accuracy.