Large multimodal model for open vocabulary semantic segmentation of remote sensing images
Large multimodal model for open vocabulary semantic segmentation of remote sensing images
Blog Article
Conventional remote sensing image semantic segmentation tasks require training specialized models for specific categories of ground objects, which often fail to recognize unseen ground object categories during the training Cat Furniture process.The generalization ability of the model is the key to achieving open vocabulary semantic segmentation of remote sensing images.Recently, large multimodal models that have been pre-trained with massive amounts of image and text data have demonstrated strong generalization capabilities.Inspired by the success of large multimodal models, we propose an open vocabulary segmentation method for remote sensing images.The proposed method uses the large multimodal model LLAVA and the Playing Cards vision large model SAM to achieve segmentation of open vocabulary.
Specifically, LLAVA is used to understand the remote sensing images and the open vocabulary, and SAM is used to extract visual features of the remote sensing images.Finally, the features extracted from SAM and LLAVA are input into the mask decoder to complete semantic segmentation tasks.In order to verify the effectiveness of the proposed method, we conducted a large number of experiments on multiple ground object categories such as airplane, ship, river and lake.The qualitative and quantitative evaluation results fully verified the effectiveness of our proposed method.