[CVPR 2025]OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

计算机-人工智能-应用于图像分类/目标检测/语义分割的基础ConvNet

夏莉莉iy

1585人浏览 · 2025-04-22 19:47:43

夏莉莉iy · 2025-04-22 19:47:43 发布

论文网址：[2502.20087] OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

论文代码：GitHub - LMMMEng/OverLoCK: [CVPR 2025 Oral] OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.1. Deep-stage Decomposition

2.4.2. Dynamic Convolution with Context-Mixing

2.4.3. Network Architecture

2.5. Experiments

2.5.1. Image Classification

2.5.2. Object Detection and Instance Segmentation

2.5.3. Semantic Segmentation

2.5.4. Ablation Studies

2.6. Conclusion

1. 心得

（1）接Oral接接接

（2）很标准的ConvNet文章写法，可以直接套模型来跑

2. 论文逐段精读

2.1. Abstract

①Challenge: feature pyramid (downsampling) did not achieve top-down attention mechanism

2.2. Introduction

①Key property of top-down attention mechanism: guidience of feedback signal

②Effective Receptive Fields (ERF) at stage 3 and 4

other models fail to localize object in stage 3 due to classification (loss) dependence

③Performance chart of OverLoCK and other compared models:

biomimetic adj.仿生的；仿生化（技术）的

2.3. Related Work

①Mentioned classic conv nets, dynamic convs, and biomimetic models

2.4. Methodology

2.4.1. Deep-stage Decomposition

①The overview of OverLoCK:

where red lines are only applied in pre-training stage

②Structures of each block:

where feature map $\mathbf{Z}_{i}\in\mathbb{R}^{C_{z}\times H\times W}$ , context prior $\mathbf{P}_{i}\in\mathbb{R}^{C_{p}\times H\times W}$ , $\mathbf{Z}_{i+1}\in\mathbb{R}^{C_z\times H\times W}$ , $\mathbf{P^{\prime}}_i\in\mathbb{R}^{C_p\times H\times W}$ . Initial context prior $\mathbf{P}_{o}$ is added for preventing context prior dilution $\mathbf{P}_{i+1}=\alpha\mathbf{P}_{i}^{\prime}+\beta\mathbf{P}_{o}$ , $\alpha$ and $\beta$ are learnable scalars

2.4.2. Dynamic Convolution with Context-Mixing

①The pipeline of ContMix:

where $\mathbf{Q} \in \mathbb{R}^{C\times HW} = \mathrm{Re}(\mathbf{W}_q\mathbf{X})$ , $\textbf{K} \in \mathbb{R}^{C\times S^2} =\mathrm{Re}(\mathbf{W}_{k}\mathrm{Pool}(\mathbf{X}))$ , $\mathrm{Re}$ denotes reshape operator

②Evenly divide the channels of $\mathbf{Q}$ and $\textbf{K}$ into $G$ groups, obtaining $\{\mathbf{Q^{g}}\}_{g=1}^{G}$ and $\{\mathbf{K^{g}}\}_{g=1}^{G}$ , where $\mathrm{Q}^{\mathbf{g}}\in\mathbb{R}^{\frac{C}{G}\times HW}$ and $\mathbf{K^{g}}\in\mathbb{R}^{\frac{C}{G}\times S^{2}}$ . Calculating affinity matrix by:

$\{\mathbf{A^{g}}\}_{g=1}^{G}=\{\mathbf{Q^{gT}}\mathbf{K^{g}}\}_{g=1}^{G}$

where $\mathbf{A^{g}}\in\mathbb{R}^{HW\times S^{2}}$

③Define a linear kernel $\mathbf{W}_d\in\mathbb{R}^{S^2\times K^2}$ , and execute:

$\mathbf{D}^\mathbf{g}=\mathrm{softmax}(\mathbf{A}^\mathbf{g}\mathbf{W}_d)\in\mathbb{R}^{HW\times K^2}$

2.4.3. Network Architecture

①Variants of OverLoCK: Extreme-Tiny (XT), Tiny (T), Small (S), and Base (B) with variables channels, blocks, kernel sizes, and groups

2.5. Experiments

2.5.1. Image Classification

①Dataset: ImageNet-1k

②Optimizer: AdamW

③Stochastic depth rate: 0.1, 0.15, 0.4, and 0.5 for OverLoCK-XT, -T, -S, and -B models

④Image classification performance:

where #F and #P denote the FLOPs and number of Params of a model, respectively. #T refers to model type,where“C”, “T”, “M”, and “H” refer to ConvNet, Transformer, Mamba, and hybrid models