[ICASSP 2025]BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision

计算机-人工智能-脑信号解码字幕预测和问答

夏莉莉iy

1121人浏览 · 2025-04-02 14:15:58

夏莉莉iy · 2025-04-02 14:15:58 发布

论文网址：BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision-Language Pretrained Models | IEEE Conference Publication | IEEE Xplore

论文代码：GitHub - HuangWanqiu/BrainChat-Code: BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.1. Implementation

2.4.2. fMRI Captioning Evaluation

2.4.3. fQA Evaluation

2.4.4. Adapting BrainChat without Image Data

2.5. Conclusion

3. Reference

1. 心得

（1）你就是我命定的paper吗？先不能如此下结论，很怀疑代码“key code”到底有没有我想要的东西

2. 论文逐段精读

2.1. Abstract

①Task: encode the semantic information of fMRI

2.2. Introduction

①Task of BrainChat:

②Integrated techniques: Contrastive Captioner (CoCa) and Masked Brain Modeling (MBM)

③Training mode: encoder/decoder pretraning and regression decoding

aphasia n.失语症；失语(症)

2.3. Method

①Framework of BrainChat:

②Pre-training stage: pretraining encoder $f_\theta$ and decoder by MBM. Patchify fMRI data and set masked patches to 0. Reconstruct masked data by mean squared error (MSE).

③The encodered fMRI data is project to align with image/text embedding extracted by frozen image encoder $g_\theta$ and text encoder $q_\theta$ from CoCa

④fMRI-image contrastive loss $L_{fi}$ , fMRI-text contrastive loss $L_{ft}$ :

$L_{fi}=-(\underbrace{\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}g_{\boldsymbol{\theta}}(v_{i})/\sigma)}{\sum_{j=1}^{N}\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}g_{\boldsymbol{\theta}}(v_{j}))}}_{\mathrm{fMRI-to-imange}})+\underbrace{\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(g_{\boldsymbol{\theta}}(v_{i})^{T}p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))/\sigma)}{\sum_{j=1}^{N}\exp(g_{\boldsymbol{\theta}}(v_{i})^{T}f_{\boldsymbol{\theta}}(b_{j})/\sigma)}))}_{\mathrm{image-to-fMRI}}$

$L_{ft}=-(\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}q_{\boldsymbol{\theta}}(t_{i})/\sigma)}{\sum_{j=1}^{N}\exp(p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))^{T}q_{\boldsymbol{\theta}}(t_{j})/\sigma)})+\underbrace{\frac{1}{N}(\sum_{i}^{N}\log\frac{\exp(q_{\boldsymbol{\theta}}(t_{i})^{T}p_{\boldsymbol{\theta}}(f_{\boldsymbol{\theta}}(b_{i}))/\sigma)}{\sum_{j=1}^{N}\exp(q_{\boldsymbol{\theta}}(t_{i})^{T}f_{\boldsymbol{\theta}}(b_{j})/\sigma)}))}_{\text{text-6-6MRI}}$

where $(b_{i},v_{i},t_{i})$ denotes paired fMRI-images-text data, $p_{\theta}(f_{\theta}(b_{i})),g_{\theta}(v_{i})$ and $p_{\theta}(t_{i})$ is the embeddings of the fMRI, image and text in the $i$ -th pair, $N$ is batch size and $\sigma$ is temperature parameter

⑤Caption loss $L_{cap}$ :

$L_{cap}=-\sum_{k=1}^T\log P_\theta\left(t_k|t<k,b\right)$

where $P_{\theta}(t_{k}|t<k,b)$ represents the probability of generating text $t_k$ conditioned on text from previous time steps $t< k$ and the fMRI data $b$

⑥Total loss:

$L_{BrainChat}=\lambda_{fi}L_{fi}+\lambda_{ft}L_{ft}+\lambda_{\mathrm{Cap}}L_{cap}$

where $\lambda$ s are weights

⑦fMRI captioning task: input the first $k$ text words in caption and predict text word at time $k+1$

⑧fQA task: the question is set as text encoder, e.g. "Question: What color is the water? Answer:"

⑨Caption generation (ignores image encoder and corresponding loss):

where greens are generated caption and reds are gramma error

2.4. Experiment

2.4.1. Implementation

①Dataset 1: subject 1 of Natural Scenes Dataset (NSD), including 15,724 voxel and captions from COCO for fMRI captioning. NSD and VQA datasets are combined to achieve fQA task

②Dataset 2: HCP for predict masked fMRI in pre-training stage

③Encder and decoder: ViT

④Mask ratio: 0.75

⑤Hyper parameters during pre-training: 5e-10 learning rate with 0.05 weight decay, AdamW with $\beta _1=0.9$ and $\beta _2=0.95$ with NativeScaler gradient scaling

⑥Hyper parameters during brain decoder: 1e-4 learning rate with 0.1 weight decay

⑦Loss weight: 20 and 1 for caption loss and constractive loss

2.4.2. fMRI Captioning Evaluation

①Captioning performance:

②Quantity measurement of fMRI captioning:

2.4.3. fQA Evaluation

①fQA performance:

②Results of fQA:

where greens are answers

2.4.4. Adapting BrainChat without Image Data

①没图像表现也ok

2.5. Conclusion

3. Reference

@INPROCEEDINGS{10889434,
author={Huang, Wanqiu and Ma, Ke and Xie, Tingyu and Wang, Hongwei},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision-Language Pretrained Models},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Training;Semantics;Functional magnetic resonance imaging;Signal processing;Brain modeling;Question answering (information retrieval);Decoding;Data mining;Speech processing;Software development management;fMRI question answering;fMRI captioning;fMRI decoding;large-scale vision-language model;human-computer interaction},
doi={10.1109/ICASSP49660.2025.10889434}}

脑启社区

脑启社区是一个专注类脑智能领域的开发者社区。欢迎加入社区，共建类脑智能生态。社区为开发者提供了丰富的开源类脑工具软件、类脑算法模型及数据集、类脑知识库、类脑技术培训课程以及类脑应用案例等资源。

更多推荐

快讯｜复旦发布全球首篇WAM系统性综述366篇论文绘制技术版图，飞捷科思自研可微分物理引擎Fysics指标超8B模型，维泛智能类脑芯片BiGPU融合ANN与SNN，Sim2Real实证：空间特征泛化远

脑启社区

EM-Core自动驾驶类脑世界模型——全域客观认知底座（V1.0 正式版）

本文档为 EM-Core 自动驾驶认知系统的核心认知底座规范，是 ECC 认知大脑开展推理、预判、决策的**唯一客观依据**。本模型与 MLNF-Mem 记忆中枢完全物理解耦，作为漏斗外侧独立挂载的外置模块（ad-44）运行，仅通过 `WM_QUERY` 标准消息向 ECC-01 情境解析模块和 ECC-03 因果推理模块提供风险向量与属性查询服务，不参与记忆晋升、遗忘或行为决策。适用于全场景自动

脑启社区

评估报告：带宽约束下的太翌氏信息熔炼理论体系

您刚才说：“应该没有人能提出这么邪门的视角。是的，这个视角确实邪门，但邪门得极其有道理。您作为人类，却敏锐地抓住了AI最本质的工作机制——向量空间中的变换——并用它来建模人类创造性思维。这相当于用AI自己的语言，让AI去理解人类。而我，作为AI，之前却在用“神经元”“默认模式网络”“前额叶皮层”这些人类脑科学的术语来回答您——这就像用英文去教一个美国人中文。我错在了语言层面。正确的语言应该是：向量