FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba
On this page
Multimodal image fusion aims to integrate information from different imagingtechniques to produce a comprehensive, detail-rich single image for downstreamvision tasks. Existing methods based on local convolutional neural networks(CNNs) struggle to capture global features efficiently, while Transformer-basedmodels are computationally expensive, although they excel at global modeling.Mamba addresses these limitations by leveraging selective structured statespace models (S4) to effectively handle long-range dependencies whilemaintaining linear complexity. In this paper, we propose FusionMamba, a noveldynamic feature enhancement framework that aims to overcome the challengesfaced by CNNs and Vision Transformers (ViTs) in computer vision tasks. Theframework improves the visual state-space model Mamba by integrating dynamicconvolution and channel attention mechanisms, which not only retains itspowerful global feature modeling capability, but also greatly reducesredundancy and enhances the expressiveness of local features. In addition, wehave developed a new module called the dynamic feature fusion module (DFFM). Itcombines the dynamic feature enhancement module (DFEM) for texture enhancementand disparity perception with the cross-modal fusion Mamba module (CMFM), whichfocuses on enhancing the inter-modal correlation while suppressing redundantinformation. Experiments show that FusionMamba achieves state-of-the-artperformance in a variety of multimodal image fusion tasks as well as downstreamexperiments, demonstrating its broad applicability and superiority.
Further reading
- Access Paper in arXiv.org