别再死记硬背公式了!用PyTorch手把手实现一个前馈网络,搞定NLP文本分类
·
从零实现PyTorch前馈网络:NLP文本分类实战指南
在自然语言处理领域,文本分类是最基础也最实用的任务之一。无论是电商评论的情感分析、新闻自动归类,还是垃圾邮件识别,本质上都是让机器学会理解文本并将其划分到预定义的类别中。传统方法依赖复杂的特征工程和统计学习,而现代神经网络则能够自动从原始文本中提取有用特征。本文将完全从实践角度出发,带你用PyTorch框架亲手构建一个前馈神经网络(Feedforward Neural Network),无需深陷数学公式的泥潭,直接获得可运行的解决方案。
1. 环境准备与数据加载
1.1 搭建PyTorch开发环境
确保已安装Python 3.7+版本,然后通过pip安装必要依赖:
pip install torch torchtext scikit-learn pandas
对于GPU加速,需要额外安装CUDA版本的PyTorch。可以通过官方命令检查是否启用GPU:
import torch
print(f"PyTorch版本: {torch.__version__}")
print(f"GPU可用: {torch.cuda.is_available()}")
1.2 加载文本数据集
我们将使用经典的AG News分类数据集,包含12万条新闻文本,分为World、Sports、Business、Sci/Tech四类。通过TorchText库快速加载:
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')
train_iter, test_iter = AG_NEWS(root='./data', split=('train', 'test'))
# 查看数据样例
for label, text in train_iter:
print(f"Label: {label}, Text: {text[:100]}...")
break
2. 文本预处理与向量化
2.1 构建词汇表与词袋表示
前馈网络需要固定长度的输入,我们采用词袋(BOW)表示法:
from torchtext.vocab import build_vocab_from_iterator
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1 # 标签转为0-3
# 示例:将文本转为词袋向量
def bow_transform(text, vocab_size=len(vocab)):
token_ids = text_pipeline(text)
return torch.zeros(vocab_size).scatter_(0, torch.tensor(token_ids), 1)
sample_vector = bow_transform("Apple launches new product")
print(f"向量维度: {sample_vector.shape}")
2.2 创建数据加载器
为高效训练,需要实现批处理和数据管道:
from torch.utils.data import DataLoader, TensorDataset
import pandas as pd
def collate_batch(batch):
label_list, text_list = [], []
for _label, _text in batch:
label_list.append(label_pipeline(_label))
text_list.append(bow_transform(_text))
return torch.stack(label_list), torch.stack(text_list)
train_loader = DataLoader(list(train_iter), batch_size=64,
shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(list(test_iter), batch_size=64,
shuffle=False, collate_fn=collate_batch)
3. 构建前馈神经网络模型
3.1 定义网络架构
创建一个三层的全连接网络,包含两个隐藏层:
import torch.nn as nn
import torch.nn.functional as F
class TextClassifier(nn.Module):
def __init__(self, vocab_size, hidden_dim1=256, hidden_dim2=128, num_classes=4):
super().__init__()
self.fc1 = nn.Linear(vocab_size, hidden_dim1)
self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
self.fc3 = nn.Linear(hidden_dim2, num_classes)
self.dropout = nn.Dropout(0.5)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = F.relu(self.fc2(x))
x = self.dropout(x)
return self.fc3(x)
model = TextClassifier(len(vocab))
print(model)
3.2 模型训练与验证
实现完整的训练循环,包含准确率计算和验证:
from torch.optim import Adam
from sklearn.metrics import accuracy_score
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
optimizer = Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
def train_epoch(model, loader, optimizer, criterion):
model.train()
total_loss, total_acc = 0, 0
for labels, texts in loader:
texts, labels = texts.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(texts)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
total_acc += accuracy_score(labels.cpu(),
outputs.argmax(dim=1).cpu())
return total_loss/len(loader), total_acc/len(loader)
def evaluate(model, loader, criterion):
model.eval()
total_loss, total_acc = 0, 0
with torch.no_grad():
for labels, texts in loader:
texts, labels = texts.to(device), labels.to(device)
outputs = model(texts)
loss = criterion(outputs, labels)
total_loss += loss.item()
total_acc += accuracy_score(labels.cpu(),
outputs.argmax(dim=1).cpu())
return total_loss/len(loader), total_acc/len(loader)
# 训练5个epoch
for epoch in range(5):
train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion)
val_loss, val_acc = evaluate(model, test_loader, criterion)
print(f"Epoch {epoch+1}: Train Loss {train_loss:.4f} Acc {train_acc:.4f} | Val Loss {val_loss:.4f} Acc {val_acc:.4f}")
4. 模型优化与调试技巧
4.1 超参数调优策略
通过实验找到最佳网络配置:
| 参数 | 测试范围 | 推荐值 | 影响分析 |
|---|---|---|---|
| 隐藏层维度 | [64, 512] | 256 | 维度太小欠拟合,太大过拟合 |
| 学习率 | [1e-4, 1e-2] | 5e-3 | 影响收敛速度和稳定性 |
| Dropout率 | [0.2, 0.7] | 0.5 | 有效防止过拟合 |
| 批大小 | [32, 256] | 64 | 平衡内存和梯度稳定性 |
实现学习率调度器提升训练效果:
from torch.optim.lr_scheduler import ReduceLROnPlateau
scheduler = ReduceLROnPlateau(optimizer, 'max', patience=2, factor=0.5)
# 在训练循环中添加
scheduler.step(val_acc)
4.2 常见问题排查
遇到训练问题时,可参考以下诊断方法:
-
损失不下降 :
- 检查数据预处理是否正确
- 尝试调大学习率
- 验证模型梯度更新(
print([p.grad for p in model.parameters()]))
-
过拟合 :
- 增加Dropout率
- 添加L2正则化
- 获取更多训练数据
-
GPU内存不足 :
- 减小批大小
- 使用梯度累积技巧
- 精简模型结构
添加权重可视化工具监控训练:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
# 在训练循环中添加
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Accuracy/train', train_acc, epoch)
5. 模型部署与应用实例
5.1 保存与加载模型
训练完成后保存最佳模型:
torch.save({
'model_state_dict': model.state_dict(),
'vocab': vocab,
'config': {
'hidden_dim1': 256,
'hidden_dim2': 128
}
}, 'text_classifier.pth')
# 加载模型
checkpoint = torch.load('text_classifier.pth')
loaded_model = TextClassifier(
len(checkpoint['vocab']),
checkpoint['config']['hidden_dim1'],
checkpoint['config']['hidden_dim2']
)
loaded_model.load_state_dict(checkpoint['model_state_dict'])
5.2 构建预测API
创建一个简单的Flask应用提供分类服务:
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
text = request.json['text']
vector = bow_transform(text).unsqueeze(0).to(device)
with torch.no_grad():
output = model(vector)
prob = F.softmax(output, dim=1)
return jsonify({
'class': int(output.argmax()),
'confidence': float(prob.max()),
'probabilities': prob.squeeze().tolist()
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
5.3 实际应用扩展
将模型应用于真实业务场景时,可考虑以下优化方向:
- 动态词表更新 :定期用新数据扩展词汇表
- 集成学习 :结合多个模型提升鲁棒性
- 主动学习 :人工标注模型不确定的样本
- 领域适应 :通过微调迁移到特定领域
# 增量训练示例
def incremental_train(new_data_iter, epochs=1):
new_vocab = build_vocab_from_iterator(yield_tokens(new_data_iter),
specials=["<unk>"],
vocab=vocab)
optimizer = Adam(model.parameters(), lr=1e-4)
for epoch in range(epochs):
train_epoch(model, DataLoader(list(new_data_iter), batch_size=32,
collate_fn=collate_batch),
optimizer, criterion)
更多推荐
所有评论(0)