VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

Self-Supervised Learning (SSL) has demonstrated promising results in 3Dmedical image analysis. However, the lack of high-level semantics inpre-training still heavily hinders the performance of downstream tasks. Weobserve that 3D medical images contain relatively consistent contextualposition information, i.e., consistent geometric relations between differentorgans, which leads to a potential way for us to learn consistent semanticrepresentations in pre-training. In this paper, we propose asimple-yet-effective Volume Contrast (VoCo) framework to leverage thecontextual position priors for pre-training. Specifically, we first generate agroup of base crops from different regions while enforcing feature discrepancyamong them, where we employ them as class assignments of different regions.Then, we randomly crop sub-volumes and predict them belonging to which class(located at which region) by contrasting their similarity to different basecrops, which can be seen as predicting contextual positions of differentsub-volumes. Through this pretext task, VoCo implicitly encodes the contextualposition priors into model representations without the guidance of annotations,enabling us to effectively improve the performance of downstream tasks thatrequire high-level semantics. Extensive experimental results on six downstreamtasks demonstrate the superior effectiveness of VoCo. Code will be available athttps://github.com/Luffy03/VoCo.

Further reading