CIFAR-10深度学习实验报告 - GPU加速训练

实验概述

项目	内容
实验名称	CIFAR-10图像分类器GPU加速训练
实验日期	2026-03-27
目标数据集	CIFAR-10 (10类彩色图像)
训练设备	NVIDIA GeForce RTX 5090 Laptop GPU
CUDA版本	12.8

环境配置

硬件环境

组件	配置
GPU	NVIDIA GeForce RTX 5090 Laptop GPU
显存	24 GB
驱动版本	595.79
CUDA版本	13.2

软件环境

组件	版本
Python	3.13
PyTorch	2.11.0+cu128
torchvision	0.26.0+cu128
CUDA	12.8
操作系统	Windows 11 Pro

PyTorch GPU版本安装

 # 卸载CPU版本
 pip uninstall torch torchvision -y
 
 # 安装GPU版本 (CUDA 12.8)
 pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

验证GPU可用性

 import torch
 print(f"CUDA available: {torch.cuda.is_available()}")  # True
 print(f"CUDA version: {torch.version.cuda}")           # 12.8
 print(f"GPU name: {torch.cuda.get_device_name(0)}")    # NVIDIA GeForce RTX 5090 Laptop GPU

数据集信息

属性	值
数据集	CIFAR-10
训练集	50,000 样本
测试集	10,000 样本
图像尺寸	32x32 彩色图像
类别数	10

CIFAR-10 类别

编号	类别	说明
0	airplane	飞机
1	automobile	汽车
2	bird	鸟
3	cat	猫
4	deer	鹿
5	dog	狗
6	frog	青蛙
7	horse	马
8	ship	船
9	truck	卡车

数据预处理

 # 训练集：数据增强 + 归一化
 train_transform = transforms.Compose([
     transforms.RandomCrop(32, padding=4),    # 随机裁剪
     transforms.RandomHorizontalFlip(),       # 水平翻转
     transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
 ])
 
 # 测试集：仅归一化
 test_transform = transforms.Compose([
     transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
 ])

GPU设备选择逻辑

设备自动检测代码

 def get_device():
     """
     自动检测并选择最佳计算设备
     优先级: CUDA > MPS > CPU
     """
     if torch.cuda.is_available():
         device = torch.device("cuda")
         gpu_name = torch.cuda.get_device_name(0)
         gpu_count = torch.cuda.device_count()
         print(f"[GPU] 使用CUDA设备: {gpu_name}")
         print(f"[GPU] GPU数量: {gpu_count}")
         print(f"[GPU] CUDA版本: {torch.version.cuda}")
         # 显示GPU内存
         mem_allocated = torch.cuda.memory_allocated(0) / 1024**2
         mem_reserved = torch.cuda.memory_reserved(0) / 1024**2
         print(f"[GPU] 已分配内存: {mem_allocated:.2f} MB")
     elif torch.backends.mps.is_available():
         device = torch.device("mps")
         print("[GPU] 使用Apple MPS设备")
     else:
         device = torch.device("cpu")
         print("[CPU] 使用CPU设备")
     return device

设备选择流程

 开始
   │
   ├─ CUDA可用? ──是──> 使用 CUDA GPU
   │
   ├─ 否
   │   └─ MPS可用? ──是──> 使用 Apple MPS
   │
   └─ 否
       └─ 使用 CPU

模型架构

CIFAR10Net 网络结构

 CIFAR10Net(
   # 第一个卷积块: 32x32 -> 16x16
   (conv1): Conv2d(3, 32, 3x3, padding=1) + BatchNorm + ReLU
   (conv2): Conv2d(32, 32, 3x3, padding=1) + BatchNorm + ReLU
   (pool1): MaxPool2d(2x2)
 
   # 第二个卷积块: 16x16 -> 8x8
   (conv3): Conv2d(32, 64, 3x3, padding=1) + BatchNorm + ReLU
   (conv4): Conv2d(64, 64, 3x3, padding=1) + BatchNorm + ReLU
   (pool2): MaxPool2d(2x2)
 
   # 第三个卷积块: 8x8 -> 4x4
   (conv5): Conv2d(64, 128, 3x3, padding=1) + BatchNorm + ReLU
   (conv6): Conv2d(128, 128, 3x3, padding=1) + BatchNorm + ReLU
   (pool3): MaxPool2d(2x2)
 
   # 全连接层
   (fc1): Linear(2048 -> 256)
   (dropout): Dropout(0.5)
   (fc2): Linear(256 -> 10)
 )

模型参数统计

说明	值
总参数量	815,018
可训练参数	815,018
模型大小	~3.2 MB

训练配置

参数	值	说明
批次大小	128	训练时
测试批次大小	256	评估时
训练轮次	15 epochs
优化器	Adam
初始学习率	0.001
学习率调度	StepLR	step_size=10, gamma=0.5
损失函数	CrossEntropyLoss	包含Softmax
Dropout	0.5	防止过拟合

训练过程

训练日志摘要

Epoch	Train Loss	Test Loss	Train Acc	Test Acc	Time(s)	LR
1	1.5592	1.0922	42.49%	60.74%	21.04	0.001000
2	1.1326	1.0773	59.84%	61.32%	15.44	0.001000
3	0.9534	0.7953	66.57%	71.72%	13.41	0.001000
4	0.8481	0.7147	70.93%	75.71%	13.56	0.001000
5	0.7685	0.7603	74.08%	74.45%	13.41	0.001000
6	0.7152	0.8444	75.72%	71.71%	17.20	0.001000
7	0.6648	0.5602	77.66%	80.25%	20.50	0.001000
8	0.6272	0.5824	78.96%	80.13%	20.31	0.001000
9	0.5870	0.5037	80.41%	82.52%	20.27	0.001000
10	0.5598	0.5997	81.54%	80.19%	19.25	0.000500
11	0.4864	0.4551	83.67%	84.22%	13.50	0.000500
12	0.4635	0.4479	84.86%	85.19%	13.50	0.000500
13	0.4452	0.4452	85.25%	85.18%	13.45	0.000500
14	0.4309	0.4519	85.60%	85.24%	13.58	0.000500
15	0.4189	0.4547	86.07%	85.03%	13.38	0.000500

关键指标

指标	值
最终测试集损失	0.4547
最终测试集准确率	85.03%
最佳测试集准确率	85.24%
总训练时间	~272秒 (~4.5分钟)
平均每epoch时间	~18秒

训练曲线分析

 准确率趋势:
 Epoch  1: 60.74%  ████████████████
 Epoch  5: 74.45%  ██████████████████████
 Epoch  9: 82.52%  ████████████████████████████
 Epoch 12: 85.19%  ███████████████████████████████
 Epoch 15: 85.03%  ███████████████████████████████
 
 损失趋势:
 Epoch  1: 1.0922  ████████████████████████████████
 Epoch  5: 0.7603  ████████████████████████████
 Epoch  9: 0.5037  ███████████████
 Epoch 12: 0.4479  ██████████████
 Epoch 15: 0.4547  ██████████████

GPU加速效果对比分析

性能对比测试

 # 测试条件: 相同模型、相同数据、10次迭代
 # 模型: CIFAR10Net (815,018参数)
 # 批次: 256样本
 
 CPU训练时间: 1.28秒
 GPU训练时间: 0.15秒
 加速比: 8.41x

对比结果

设备	训练时间	加速比
CPU	1.28s	1.0x (基准)
GPU (RTX 5090)	0.15s	8.41x

GPU内存使用

指标	值
已分配内存	30.21 MB
已预留内存	368.00 MB
最大分配	191.19 MB

8.4 加速效果分析

加速比: RTX 5090实现了8.41倍的加速效果
适用场景: GPU加速对大规模神经网络训练效果显著
内存效率: 24GB显存仅使用约191MB，表明模型规模可进一步扩大
并行计算: 深度学习的矩阵运算天然适合GPU并行架构

GPU训练核心代码

数据迁移到GPU

 # 训练时
 data, target = data.to(device), target.to(device)
 
 # 模型推理时
 output = model(data.to(device))

完整训练循环

 for epoch in range(1, epochs + 1):
     model.train()
     for batch_idx, (data, target) in enumerate(train_loader):
         # 1. 数据迁移到GPU
         data, target = data.to(device), target.to(device)
 
         # 2. 梯度清零
         optimizer.zero_grad()
 
         # 3. 前向传播 (GPU计算)
         output = model(data)
 
         # 4. 计算损失
         loss = criterion(output, target)
 
         # 5. 反向传播 (GPU计算)
         loss.backward()
 
         # 6. 参数更新
         optimizer.step()

GPU同步

 # 对于精确的时间测量，需要GPU同步
 torch.cuda.synchronize()
 start = time.time()
 # ... GPU操作 ...
 torch.cuda.synchronize()
 end = time.time()

注意事项

Windows系统

 # Windows建议设置
 num_workers = 0  # 避免多进程问题
 pin_memory = True  # 加速GPU数据传输

PyTorch GPU配置

torch.cuda.is_available() - 检查CUDA是否可用
torch.cuda.device_count() - 获取GPU数量
torch.cuda.get_device_name(0) - 获取GPU名称
torch.cuda.memory_allocated() - 查看当前显存使用
torch.cuda.synchronize() - 确保GPU操作完成

扩展实验建议

更深网络: 尝试ResNet、VGG等复杂架构
数据增强: 添加随机旋转、颜色抖动等
学习率调度: 尝试CosineAnnealing、ReduceLROnPlateau
正则化: 调整Dropout率、添加L2正则化
批量归一化: 深入理解BatchNorm的作用
混合精度训练: 使用FP16进一步加速

结论

本实验成功完成了CIFAR-10图像分类任务，实现了以下目标：

成果	指标
模型设计	自定义CNN (815K参数)
GPU加速	8.41倍加速比
测试准确率	85.24%
训练时间	~4.5分钟 (GPU)

通过本实验，初学者可以掌握：

GPU设备选择与配置
PyTorch CUDA编程基础
卷积神经网络设计与训练
GPU性能分析与优化

输出文件清单

文件	说明
cifar10_classifier.py	完整训练代码
training_log_cifar10_20260327_195423.txt	训练日志
best_cifar10_model_20260327_195423.pth	保存的模型
CIFAR10_GPU_实验报告.md	本实验报告

完整代码

 """
 CIFAR-10图像分类器 - GPU深度学习实验
 基于PyTorch实现，适合初学者学习GPU加速
 
 数据集: CIFAR-10 (10类彩色图像: 飞机、汽车、鸟、猫、鹿、狗、青蛙、马、船、卡车)
 """
 
 import torch
 import torch.nn as nn
 import torch.optim as optim
 from torch.utils.data import DataLoader
 from torchvision import datasets, transforms
 import logging
 from datetime import datetime
 import time
 
 # ============================================================
 # 1. GPU设备配置与选择逻辑
 # ============================================================
 def get_device():
     """
     自动检测并选择最佳计算设备
     优先级: CUDA > MPS > CPU
     """
     if torch.cuda.is_available():
         device = torch.device("cuda")
         gpu_name = torch.cuda.get_device_name(0)
         gpu_count = torch.cuda.device_count()
         print(f"[GPU] 使用CUDA设备: {gpu_name}")
         print(f"[GPU] GPU数量: {gpu_count}")
         print(f"[GPU] CUDA版本: {torch.version.cuda}")
         # 显示GPU内存
         mem_allocated = torch.cuda.memory_allocated(0) / 1024**2
         mem_reserved = torch.cuda.memory_reserved(0) / 1024**2
         print(f"[GPU] 已分配内存: {mem_allocated:.2f} MB, 预留: {mem_reserved:.2f} MB")
     elif torch.backends.mps.is_available():
         device = torch.device("mps")
         print("[GPU] 使用Apple MPS设备")
     else:
         device = torch.device("cpu")
         print("[CPU] 使用CPU设备")
 
     return device
 
 
 def setup_logger(log_file):
     """配置日志，同时输出到文件和控制台"""
     logger = logging.getLogger("CIFAR10_Training")
     logger.setLevel(logging.INFO)
     logger.handlers.clear()
 
     formatter = logging.Formatter(
         '%(asctime)s - %(levelname)s - %(message)s',
         datefmt='%Y-%m-%d %H:%M:%S'
     )
 
     file_handler = logging.FileHandler(log_file, mode='w', encoding='utf-8')
     file_handler.setFormatter(formatter)
     console_handler = logging.StreamHandler()
     console_handler.setFormatter(formatter)
 
     logger.addHandler(file_handler)
     logger.addHandler(console_handler)
     return logger
 
 
 # ============================================================
 # 2. 定义CNN模型 (针对CIFAR-10优化)
 # ============================================================
 class CIFAR10Net(nn.Module):
     """
     CIFAR-10专用卷积神经网络
     结构: Conv-BN-ReLU-Pool + Conv-BN-ReLU-Pool + Conv-BN-ReLU + FC + FC
 
     输入: 3x32x32 彩色图像
     输出: 10个类别
     """
     def __init__(self):
         super(CIFAR10Net, self).__init__()
 
         # 第一个卷积块
         self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
         self.bn1 = nn.BatchNorm2d(32)
         self.conv2 = nn.Conv2d(32, 32, kernel_size=3, padding=1)
         self.bn2 = nn.BatchNorm2d(32)
         self.pool1 = nn.MaxPool2d(2, 2)  # 32x32 -> 16x16
 
         # 第二个卷积块
         self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
         self.bn3 = nn.BatchNorm2d(64)
         self.conv4 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
         self.bn4 = nn.BatchNorm2d(64)
         self.pool2 = nn.MaxPool2d(2, 2)  # 16x16 -> 8x8
 
         # 第三个卷积块
         self.conv5 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
         self.bn5 = nn.BatchNorm2d(128)
         self.conv6 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
         self.bn6 = nn.BatchNorm2d(128)
         self.pool3 = nn.MaxPool2d(2, 2)  # 8x8 -> 4x4
 
         # 全连接层
         self.fc1 = nn.Linear(128 * 4 * 4, 256)
         self.fc2 = nn.Linear(256, 10)
 
         self.relu = nn.ReLU()
         self.dropout = nn.Dropout(0.5)
 
     def _conv_block(self, x, conv1, bn1, conv2, bn2, pool):
         """卷积块: Conv -> BN -> ReLU -> Conv -> BN -> ReLU -> Pool"""
         x = self.relu(bn1(conv1(x)))
         x = self.relu(bn2(conv2(x)))
         return pool(x)
 
     def forward(self, x):
         # 三个卷积块
         x = self._conv_block(x, self.conv1, self.bn1, self.conv2, self.bn2, self.pool1)
         x = self._conv_block(x, self.conv3, self.bn3, self.conv4, self.bn4, self.pool2)
         x = self._conv_block(x, self.conv5, self.bn5, self.conv6, self.bn6, self.pool3)
 
         # 全连接层
         x = x.view(-1, 128 * 4 * 4)
         x = self.relu(self.fc1(x))
         x = self.dropout(x)
         return self.fc2(x)
 
 
 # ============================================================
 # 3. 训练函数
 # ============================================================
 def train_one_epoch(model, loader, criterion, optimizer, device):
     """训练一个epoch"""
     model.train()
     running_loss = 0.0
     correct = 0
     total = 0
 
     for batch_idx, (data, target) in enumerate(loader):
         # --- GPU数据传输 ---
         data, target = data.to(device), target.to(device)
 
         # 梯度清零
         optimizer.zero_grad()
 
         # 前向传播 (GPU计算)
         output = model(data)
 
         # 计算损失
         loss = criterion(output, target)
 
         # 反向传播 (GPU计算)
         loss.backward()
 
         # 参数更新
         optimizer.step()
 
         # 统计
         running_loss += loss.item()
         _, predicted = output.max(1)
         total += target.size(0)
         correct += predicted.eq(target).sum().item()
 
     return running_loss / len(loader), 100. * correct / total
 
 
 def evaluate(model, loader, criterion, device):
     """评估模型"""
     model.eval()
     test_loss = 0.0
     correct = 0
     total = 0
 
     with torch.no_grad():
         for data, target in loader:
             # --- GPU推理 ---
             data, target = data.to(device), target.to(device)
             output = model(data)
             test_loss += criterion(output, target).item()
             _, predicted = output.max(1)
             total += target.size(0)
             correct += predicted.eq(target).sum().item()
 
     return test_loss / len(loader), 100. * correct / total
 
 
 # ============================================================
 # 4. 主训练流程
 # ============================================================
 def main():
     # 创建日志
     timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
     log_file = f"training_log_cifar10_{timestamp}.txt"
     logger = setup_logger(log_file)
 
     logger.info("=" * 70)
     logger.info("CIFAR-10深度学习实验 - GPU加速训练")
     logger.info("=" * 70)
 
     # --------------------------------------------------
     # 4.1 GPU设备选择
     # --------------------------------------------------
     device = get_device()
     logger.info(f"最终使用设备: {device}")
 
     # --------------------------------------------------
     # 4.2 数据加载
     # --------------------------------------------------
     logger.info("正在加载CIFAR-10数据集...")
 
     # 数据增强 + 归一化
     train_transform = transforms.Compose([
         transforms.RandomCrop(32, padding=4),
         transforms.RandomHorizontalFlip(),
         transforms.ToTensor(),
         transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
     ])
 
     test_transform = transforms.Compose([
         transforms.ToTensor(),
         transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
     ])
 
     # 加载数据集
     train_dataset = datasets.CIFAR10(
         root='./data',
         train=True,
         download=True,
         transform=train_transform
     )
     test_dataset = datasets.CIFAR10(
         root='./data',
         train=False,
         download=True,
         transform=test_transform
     )
 
     logger.info(f"训练集样本数: {len(train_dataset)}")
     logger.info(f"测试集样本数: {len(test_dataset)}")
     logger.info(f"类别: {', '.join(train_dataset.classes)}")
 
     # DataLoader
     train_loader = DataLoader(
         train_dataset,
         batch_size=128,
         shuffle=True,
         num_workers=0,
         pin_memory=True  # GPU加速数据传输
     )
     test_loader = DataLoader(
         test_dataset,
         batch_size=256,
         shuffle=False,
         num_workers=0,
         pin_memory=True
     )
 
     # --------------------------------------------------
     # 4.3 模型初始化
     # --------------------------------------------------
     model = CIFAR10Net().to(device)
     logger.info(f"\n模型结构:\n{model}")
 
     total_params = sum(p.numel() for p in model.parameters())
     trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
     logger.info(f"\n总参数量: {total_params:,}")
     logger.info(f"可训练参数量: {trainable_params:,}")
 
     # --------------------------------------------------
     # 4.4 损失函数和优化器
     # --------------------------------------------------
     criterion = nn.CrossEntropyLoss()
     optimizer = optim.Adam(model.parameters(), lr=0.001)
     scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
 
     # --------------------------------------------------
     # 4.5 GPU训练计时
     # --------------------------------------------------
     logger.info("\n" + "=" * 70)
     logger.info("开始GPU训练")
     logger.info("=" * 70)
 
     epochs = 15
     best_acc = 0.0
     model_path = None
 
     for epoch in range(1, epochs + 1):
         epoch_start = time.time()
 
         # 训练
         train_loss, train_acc = train_one_epoch(
             model, train_loader, criterion, optimizer, device
         )
 
         # 更新学习率
         scheduler.step()
 
         # 评估
         test_loss, test_acc = evaluate(model, test_loader, criterion, device)
 
         epoch_time = time.time() - epoch_start
 
         # 记录日志
         logger.info(
             f"Epoch [{epoch:2d}/{epochs}] | "
             f"Loss: {train_loss:.4f}/{test_loss:.4f} | "
             f"Acc: {train_acc:.2f}%/{test_acc:.2f}% | "
             f"Time: {epoch_time:.2f}s | "
             f"LR: {optimizer.param_groups[0]['lr']:.6f}"
         )
 
         # 保存最佳模型
         if test_acc > best_acc:
             best_acc = test_acc
             model_path = f"best_cifar10_model_{timestamp}.pth"
             torch.save(model.state_dict(), model_path)
             logger.info(f"  -> 新最佳准确率: {best_acc:.2f}%, 模型已保存")
 
     # --------------------------------------------------
     # 4.6 最终评估
     # --------------------------------------------------
     logger.info("\n" + "=" * 70)
     logger.info("训练完成 - 最终评估")
     logger.info("=" * 70)
 
     final_loss, final_acc = evaluate(model, test_loader, criterion, device)
     logger.info(f"测试集损失: {final_loss:.4f}")
     logger.info(f"测试集准确率: {final_acc:.2f}%")
     logger.info(f"最佳准确率: {best_acc:.2f}%")
 
     # --------------------------------------------------
     # 4.7 GPU内存使用统计
     # --------------------------------------------------
     if torch.cuda.is_available():
         logger.info(f"\nGPU内存统计:")
         logger.info(f"  - 已分配: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB")
         logger.info(f"  - 已预留: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB")
         logger.info(f"  - 最大分配: {torch.cuda.max_memory_allocated(0) / 1024**2:.2f} MB")
 
     logger.info(f"\n模型保存: {model_path}")
     logger.info("\n" + "=" * 70)
     logger.info("训练完成!")
     logger.info("=" * 70)
 
 
 # ============================================================
 # 5. 单独的性能对比测试
 # ============================================================
 def benchmark_device():
     """测试CPU vs GPU性能差异"""
     print("\n" + "=" * 70)
     print("GPU vs CPU 性能对比测试")
     print("=" * 70)
 
     device_cpu = torch.device("cpu")
     device_gpu = torch.device("cuda") if torch.cuda.is_available() else None
 
     # 创建测试模型和数据
     model = CIFAR10Net()
     test_input = torch.randn(256, 3, 32, 32)
     test_target = torch.randint(0, 10, (256,))
 
     # CPU测试
     model_cpu = model.to(device_cpu)
     criterion = nn.CrossEntropyLoss()
     optimizer = optim.Adam(model_cpu.parameters(), lr=0.001)
 
     start = time.time()
     for _ in range(10):
         optimizer.zero_grad()
         output = model_cpu(test_input)
         loss = criterion(output, test_target)
         loss.backward()
         optimizer.step()
     cpu_time = time.time() - start
     print(f"[CPU] 训练时间: {cpu_time:.2f}s")
 
     # GPU测试
     if device_gpu:
         model_gpu = model.to(device_gpu)
         test_input_gpu = test_input.to(device_gpu)
         test_target_gpu = test_target.to(device_gpu)
         optimizer_gpu = optim.Adam(model_gpu.parameters(), lr=0.001)
 
         torch.cuda.synchronize()
         start = time.time()
         for _ in range(10):
             optimizer_gpu.zero_grad()
             output = model_gpu(test_input_gpu)
             loss = criterion(output, test_target_gpu)
             loss.backward()
             optimizer_gpu.step()
         torch.cuda.synchronize()
         gpu_time = time.time() - start
         print(f"[GPU] 训练时间: {gpu_time:.2f}s")
         print(f"[加速比] GPU比CPU快 {cpu_time/gpu_time:.2f}x")
 
 
 if __name__ == "__main__":
     main()
     benchmark_device()

2026 年 5 月
日	一	二	三	四	五	六
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31