论文网址:Exploitation of a Latent Mechanism in Graph Contrastive Learning: Representation Scattering | OpenReview

论文代码:https://github.com/hedongxiao-tju/SGRL

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Preliminary

2.4. Representation Scattering in GCL

2.4.1. DGI-like methods

2.4.2. InfoNCE-based methods

2.4.3. BGRL-like methods

2.5. Methodology

2.5.1. Representation Scattering Mechanism (RSM)

2.5.2. Topology-based Constraint Mechanism(TCM)

2.6. Experiment

2.6.1. Performance Analysis

2.6.2. Model Analysis

2.7. Conclusion 

3. Reference


1. 心得

(1)额~不太是我的方向

2. 论文逐段精读

2.1. Abstract

        ①They proposed Scattering Graph Representation Learning (SGRL)

2.2. Introduction

        ①Representation scattering is the same crutial factor in node discrimination, group discrimination, bootstrapping schemes

        ②Existing works do not fully utilize the mechanism of representation scattering, some of them just employ batch norm

2.3. Preliminary

        ①Graph setting: \mathcal{G}=(\mathcal{V},\mathcal{E}), where \mathcal{V}=\{v_{1},v_{2},\ldots,v_{N}\} is node set, \varepsilon\subseteq\mathcal{V}\times\mathcal{V} is edge set

        ②Feature matrix: \mathbf{X}\in\mathbb{R}^{N\times D}N denotes the number of nodes, D is the feature dimension

        ③Adjacency matrix: binarized \mathbf{A}\in\mathbb{R}^{N\times N}

        ④Degree matrix: \mathbf{D}=\mathrm{diag}(d_{1},d_{2},\ldots,d_{N}), each element d_{i}=\sum_{i\in\mathcal{V}}\mathbf{A}_{ij}

        ⑤Degree normalized adjacency matrix: \mathbf{A}_{\mathrm{sym}}=\mathbf{D}^{-1/2}(\mathbf{A}+\mathbf{I})\mathbf{D}^{-1/2}

2.4. Representation Scattering in GCL

        ①Definition of representation scattering: in a d-dimensional embedding space \mathbb{R}^d comprising n vectors organized into a matrix \mathbf{V}\in\mathbb{R}^{n\times d}, consider a subspace \mathbb{S}^{k}(1\leq k\leq d) of \mathbb{R}^{d} and a scatter center \mathbf{c}. there are 2 constraint:

Center-Away Constraint: Node representations are encouraged to be distant from the scattered center \mathbf{c}
Uniformity Constraint: Node representations are uniformly distributed over the subspace \mathbb{S}^{k}

2.4.1. DGI-like methods

        ①DGI-like methods employ mutual information discriminator \mathcal{D}\left ( \cdot \right ) to maximizes the mutual information between nodes and their source graphs to train the model

        ②Assumption: a) for normalized propagation matrix \mathbf{\tilde{A}}=\mathbf{D}^{-1}\mathbf{\hat{A}} where \mathbf{\hat{A}}=\mathbf{​{A}}+\mathbf{​{I}}, b) DGI generates the corrupted graph by randomly shuffling the entities in the feature matrix \mathbf{X} and keeping adjacency matrix \mathbf{A} unchanged, c) the orginal data are class-balanced (the numbers are SAME)

        ③Theorem: At the node level, minimizing the DGI loss is equivalent to maximizing the Jensen Shannon (JS) divergence between the local semantic distribution in the original graph \mathcal{G} and its average distribution (corrupted grpah \tilde{\mathcal{G}}), i.e.

Min\left ( \mathcal{L}_{DGI}\right )\Leftrightarrow Max\left ( JS\left ( \mathcal{G}|| \tilde{\mathcal{G}}\right ) \right )

        ④Single layer GNN:

\mathbf{H}=\mathrm{GNN}(\mathbf{A},\mathbf{X})=\tilde{\mathbf{A}}\mathbf{X}\mathbf{W}^{\prime},\mathbf{h}_{i}=\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}\mathbf{x}_{j}\cdot\mathbf{W}^{\prime}=\sum_{j\in\mathcal{N}_{i}}\frac{1}{d_{i}}\mathbf{x}_{j}\cdot\mathbf{W}^{\prime}\\ \forall j\in\mathcal{N}_{i},\mathbf{x}_{j}\sim p_{i}(\mu_{i},\sigma_{i}^{2})

where \mathbf{W}^{\prime}=\xi(\mathbf{W})\mathbf{W}\in\mathbb{R}^{D\times K} with {​{\xi}} is activation function, \mathcal{N}_{i} is first order neighbors of node v_i

        ⑤Mean of aggregated representation \mathbf{h}_i:

\mathbb{E}[\mathbf{h}_i]=\mathbb{E}\left[\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\mathbf{x}_j\cdot\mathbf{W}^{\prime}\right]=\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\mathbb{E}[\mathbf{x}_j]\cdot\mathbf{W}^{\prime}=\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\mu_i\cdot\mathbf{W}^{\prime}

        ⑥Variance of aggregated representation \mathbf{h}_i:

\mathrm{Var}(\mathbf{h}_{i})=\mathrm{Var}\left(\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\mathbf{x}_{j}\cdot\mathbf{W}^{\prime}\right)=\sum_{j\in\mathcal{N}(i)}\alpha_{ij}\mathrm{Var}(\mathbf{x}_{j})\cdot\mathbf{W}^{\prime2}=\sigma_{i}^{2}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{2}\cdot\mathbf{W}^{\prime2}

        ⑦The mean and variance of representation \mathbf{h}_i's distribution: \mu_i\cdot \mathbf{W}^{\prime} / \sigma^{2}\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{2}\cdot\mathbf{W}^{\prime2}\mathbf{h}_{i}\sim p_{i}^{\prime}(\mu_{i}\cdot\mathbf{W}^{\prime},\sigma_{i}^{2}\cdot\mathbf{W}^{\prime2})

        ⑧Randomly extract feature vectors \tilde{\mathbf{x}}_i from original graph \mathbf{X} with distribution p_{\mathrm{data}}(\mu,\sigma^{2})\mathbf{\hat{h}}_{i}\sim p_{\mathrm{data}}^{\prime}(\mu\cdot\mathbf{W}^{\prime},\sigma^{2}\cdot\mathbf{W}^{\prime2})

        ⑨Corollary: the mean of original graph as center \mathbf{c}, original representation space as subspace \mathbb{S}^{k}

        ⑩t-SNE embedding of DGI on Co.CS dataset:

where blue and red points are negative and positive samples

Corollary  n. 推论;必然的结果

2.4.2. InfoNCE-based methods

        ①Theorem: define \mathbf{\bar{h}}=1/n\sum_{i=1}^{n}\mathbf{h}_{i}sim\left ( \cdot \right ) be the cosine similarity function. For node v_i, the lower bound of InfoNCE loss is:

\mathcal{L}_{InfoNCE}(\mathbf{h}_{i})\geq sim(\mathbf{h}_{i},\mathbf{\bar{h}})+\ln(2n)

however, it's inefficient due to calculating of every representation pair

2.4.3. BGRL-like methods

        ①Batch normalization for a feature vector \mathbf{x}_i:

\mathrm{BN}(\mathbf{x}_{i})=\gamma(\mathbf{x}_{i}-\mu)/\sqrt{\sigma^{2}+\epsilon}+\beta

where \mu and \sigma ^2 are mean and variance, \gamma and \beta are learnable scale and shift parameters, \epsilon is a small constant

        ②The impact of Batch Normalization in BGRL:

2.5. Methodology

        ①The overall framework of SGRL:

2.5.1. Representation Scattering Mechanism (RSM)

        ①Apply \text{Trans}\left ( \cdot \right ) to transform \mathbb{R}^d to \mathbb{S}^k with L2 norm each row:

\tilde{\mathbf{h}}_{i}=\mathrm{Trans}_{\mathbb{R}^{d}\to\mathbb{S}^{k}}(\mathbf{h}_{i})=\frac{\mathbf{h}_{i}}{\mathrm{Max}(\|\mathbf{h}_{i}\|_{2},\varepsilon)},\quad\mathbb{S}^{k}=\{\tilde{\mathbf{h}}_{i}:\|\tilde{\mathbf{h}}_{i}\|_{2}=1\}

where \mathbf{\mathbf{}h}_i generated by target encoder, \|\tilde{\mathbf{h}}_{i}\|_{2}=(\sum_{j=1}^{k}\tilde{\mathbf{h}}_{ij}^{2})^{\frac{1}{2}}\epsilon is a small vale to avoid division by zero

        ②Scattering center:

\mathcal{L}_{\text{scattering}}=-\frac{1}{n}\sum_{i=1}^{n}\|\tilde{\mathbf{h}}_{i}-\mathbf{c}\|_{2}^{2},\quad\mathbf{c}=\frac{1}{n}\sum_{i=1}^{n}\tilde{\mathbf{h}}_{i}

2.5.2. Topology-based Constraint Mechanism(TCM)

        ①For connected \forall v_{i},v_{j}\in V,\mathbf{h}_{i},\mathbf{h}_{j}\in\mathbb{S}^{k}, there is a threshold d:\left \| \mathbf{h}_i-\mathbf{h}_j \right \|^2_2< d,\, \text{if}\left ( i,j \right )< \mathcal{V}, and \|\mathbf{h}_{i}-\mathbf{h}_{j}\|_{2}^{2}>d,\mathrm{if}(i,j)\notin\mathcal{V}

        ②Topology connection:

\mathbf{H_{target}^{topology}}=\mathbf{AH_{target}}

        ③Topology-based Constraint Mechanism (TCM)

\mathbf{H_{online}^{topology}}=\mathbf{\hat{A}^{k}H_{online}}+\mathbf{H_{online}}

where \text{ k} denotes the order of neighbors

        ④Prediction:

\mathbf{Z_{online}}=q_{\theta}(\mathbf{H_{online}^{topology}})

        ⑤Aim: align \mathbf{Z_{online}} with \mathbf{H_{target}}

        ⑥Alignment loss:

\mathcal{L}_{\mathrm{alignment}}=-\frac{1}{N}\sum_{i=1}^{N}\frac{\mathbf{Z}_{(\mathrm{online},i)}^{\top}\mathbf{H}_{(\mathrm{target},i)}}{\|\mathbf{Z}_{(\mathrm{online},i)}\|\|\mathbf{H}_{(\mathrm{target},i)}\|}

        ⑦Employ Exponential Moving Average at the end of each training epoch:

\phi\leftarrow\tau\phi+(1-\tau)\theta

where \tau \in \left [ 0,1 \right ] is target decay rate

2.6. Experiment

        ①5 datasets: AmazonPhoto (Photo) and Amazon-Computers (Computers), WikiCS, Coauthor-CS (Co.CS), Coauthor-Physics (Co.Physics)

        ②⭐Data split: 10% for training and 90% for testing

        ③Running times: 20 

2.6.1. Performance Analysis

        ①Node classification performance table:

        ②t-SNE visualization on CS dataset:

        ③Performance on Clustering in terms of NMI and homogeneity. Optimal results are shown in bold and suboptimal results are underlined:

2.6.2. Model Analysis

         ①The ablation of \text{k}:

        ②Module ablation:

2.7. Conclusion 

        ~

3. Reference

He, D. et al. (2024) 'Exploitation of a Latent Mechanism in Graph Contrastive Learning: Representation Scattering', NeurIPS.

Logo

脑启社区是一个专注类脑智能领域的开发者社区。欢迎加入社区,共建类脑智能生态。社区为开发者提供了丰富的开源类脑工具软件、类脑算法模型及数据集、类脑知识库、类脑技术培训课程以及类脑应用案例等资源。

更多推荐