深度学习 卷积 笔记

本文最后更新于：2023年6月15日下午

一些理解

用神经网络去学习构建卷积核

卷积层将输入和核矩阵进行交叉相关，加上偏移之后得到输出。

核矩阵和偏移是可学习的参数

核矩阵的大小是超参数(在训练前就已经定义好)

比较好的思想：高、宽减半，通道数翻一倍

二维交叉与二维卷积

没有本质的区别，

\(y_{i,j}=\sum_{a=1}^{h}\sum_{b=1}^{w}w_{a,b}x_{i+a,j+b}\)

索引前多了负号，因为w是学习的值，所以没有本质区别

\(y_{i,j}=\sum_{a=1}^{h}\sum_{b=1}^{w}w_{-a,-b}x_{i+a,j+b}\)

不同的维度

一维

\(y_i=\sum_{a=1}^hw_ax_{i+a}\)

处理文本、语言、时间序列
三维

\(y_{i,j,k}=\sum_{a=1}^h\sum_{b=1}^w\sum_{c=1}^d\)

实现二维卷积层

互相关运算

def corr2d(X,K):
    h,w = K.shape
    Y = torch.zeros((X.shape[0]-h+1,X.shape[1]-w+1))
    for i in range(Y.shape[0]):
    	for j in range (Y.shape[1]):
            Y[i,j]=(X[i:i+h,j:j+w]*K).sum()
    return Y

class Conv2D(nn.Module):
    def __init__(self,kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias == nn.Parameter(torch.zeros(1))
   	def forward(self,X):
        return corr2d(x,self.weight) + self.bias

例子

import torch
from torch import nn

conv2d = nn.Conv2d(1,1,kernel_size=(1,2),bias=False)

# 第一个张量矩阵
X = torch.ones((6, 8))
X[:, 2:6] = 0
X = X.reshape((1,1,6,8))
# 第二个张量矩阵
Y = torch.zeros((6,7))
Y[:,1] = 1
Y[:,-2]= -1
Y = Y.reshape((1,1,6,7))

for i in range(100):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    # 迭代卷积核
    conv2d.weight.data[:] -= 3e-2 * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'epoch {i+1}, loss {l.sum():.3f}')

print(conv2d.weight.data.reshape((1,2)))
#tensor([[ 1.0000, -1.0000]])
#表示 中间问号里的卷积核为[1,-1] 能够使得X卷积计算为Y

填充和步幅

原因

每次使用卷积核都能减小输出的大小

input :32x32 kernel_size:5x5
- 根据kernel_size 每卷积一次，减少4x4
- 第一次卷积 32x32 :arrow_right:28x28
- ...
- 第七次卷积结果 4x4
总结：每卷积一次，形状从\(n_h \times n_w\)减小到\((n_w-k_h+1)\times(n_w-k_w+1)\)
- 其中k_h为卷积核的高，k_w为卷积核的宽

填充

在输入数据的四周进行数据填充。

在上下左右分别添加0
- 输出形状为\((n_h-k_h+p_h+1) \times(n_w-k_w+p_w+1)\)
- 通常取 \(p_h=k_h-1\),\(p_w=k_w-1\)
  - k为奇数，在上下两侧填充\(p_h/2\)
  - k为偶数，在上侧填充\([p_h/2]\)，在下侧填充\([p_h/2]\)

步幅

控制卷积核移动的幅度，可以每次不止移动一格。

可以使得在输入较大、卷积核较小的情况下，快速达到较小的输出，减少中间的卷积层数。

给定高度\(s_h\)和宽度\(s_w\)的步幅，输出形状为：\([(n_h-k_h+p_h+s_h)/s_h]\times[(n_w-k_w+p_w+s_w)/s_w]\)
- 如果 \(p_h=k_h-1\),\(p_w=k_w-1\)，输出形状为：\([(n_h+s_h-1)/s_h]\times[(n_w+s_w-1)/s_w]\)
- 如果输入高度和宽度可以被步幅整除，输出形状为：\((n_h/s_h)\times(n_w/s_w)\)

总结

填充和步幅都是卷积层的超参数，声明网络的时候加上就行
填充在输入周围添加额外的行/列，来控制输出形状的减少量，常设为p = k-1核-1
步幅是每次滑动核窗口时的行/列的步长，可以成倍地减少输出形状

例子

import torch
from torch import nn
import torch.nn.functional as F

# 在所有侧边填充一个像素
def comp_conv2d(conv2d,X):
    # （批量大小、通道、高度、宽度）
    X = X.reshape((1,1)+X.shape)

    Y = conv2d(X)
    return Y.reshape(Y.shape[2:])

# 定义一个输入输出通道数为1，kernel大小为3，填充为1的卷积神经网络
conv2d = nn.Conv2d(1,1,kernel_size=3,padding=1)
X = torch.rand(size=(8,8))
print(comp_conv2d(conv2d,X).shape)
# 8 x 8

# kernal 宽高不一样的时候，注意padding填充的设置
# 8-5+ 2x2(上下填充的)+1 = 8
# 8-3+ 1x2(左右填充的)+1 = 8
# (kernel_size - 1 )/ 2
conv2d = nn.Conv2d(1,1,kernel_size=(5,3),padding=(2,1))
print(comp_conv2d(conv2d,X).shape)
# 8 x 8

# 步幅为2的情况
conv2d = nn.Conv2d(1,1,kernel_size=3,padding=1,stride=2)
print(comp_conv2d(conv2d,X).shape)
# 4 x 4

conv2d = nn.Conv2d(1,1,kernel_size=(5,3),padding=(0,1),stride=(3,4))
print(comp_conv2d(conv2d,X).shape)
# 2 x 2 
# [(8-3+0x2+3)/3] = 2
# [(8-5+1x2+4)/4] = 2

输入和输出通道

每个通道有自己的卷积核

\(c_i\):输入通道 channel_input

\(c_o\):输出通道channel_output

单输入通道

每个通道和自己的卷积核进行计算

输入 X:\(c_i\times n_h\times n_w\)
核W:\(c_i\times k_h\times k_w\)
输出Y:\(m_h\times m_w\)

多输入通道

多个卷积核、每个核生成一个通道

每个通道可有用于识别不同的特定模式

输入 X:\(c_i\times n_h\times n_w\)
核W:\(c_o\times c_i\times k_h\times k_w\)
输出Y:\(c_o\times m_h\times m_w\)

#多输入通道
import torch
from d2l import torch as d2l

def corr2d_multi_in(X,K):
    # zip 将每一个输入和对应的卷积核绑定
    return sum(d2l.corr2d(x,k) for x,k in zip(X,K))


X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

print(corr2d_multi_in(X, K))

#多输出通道
import torch
from d2l import torch as d2l

def corr2d_multi_in_out(X,K):
    return torch.stack([corr2d_multi_in(X,k) for k in K],0)

K = torch.reshape((K,K+1,K+2),0)
print(K.shape)

`1x1`卷积层

参考：一文读懂卷积神经网络中的1x1卷积核 (qq.com)

不识别空间模式(没有周边像元参与)，只是对多个通道同一位置的像元进行融合

\(k_h=k_w=1\)

相当于输入形状为\(n_hn_w\times c_i\)，权重为\(c_o\times c_i\)的全连接层

\(z=W\cdot x^T+b\)
- \(W: \space c_o \times c_i\)
- \(x:\space n_hn_w \times c_i\)
对应位置按权重(Kernel)相加

def corr2d_multi_in_out_1x1(X,K):
    # 数据的维度、长宽
    c_i,h,w = X.shape
    # 核的个数，数据的输出维度
    c_o = K.shape[0]
    # X 拉成 h*w x c_i的张量
    X = X.reshape((c_i,h*w))
    # 卷积核，变成权重W
    K= K.reshape((c_o,c_i))
    # 实现全连接层的操作
    Y = torch.matmul(K,X)
    # 还原成矩阵
    return Y.reshape((c_o,h,w))

# 输入
X = torch.normal(0, 1, (3, 3, 3))
# 卷积层 输出2 输入3 大小 1x1
K = torch.normal(0, 1, (2, 3, 1, 1))

Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(torch.abs(Y1 - Y2).sum()) < 1e-6

总结

输出通道数是卷积层的超参数
每个输入通道有独立的二维卷积核，所有通道结果相加得到一个输出通道结果
每个输出通道有独立的三维卷积核
[3,2,2,2] 输出通道为3，输入通道为2，卷积核高2，卷积核宽2
- 有三个卷积核，每个卷积核有两层(是二维)，卷积核大小为2x2

池化

二维最大池化

返回滑动窗口中的最大值

有填充和步幅
没有可学习的参数，只有取最大的操作子
每个输入通道应用池化层获得相应的输出通道
输入通道数=输出通道数

二维平均池化

和最大池化一样，操作子变为取平均

# 手写版本
def pool2d(X,pool_size,mode='max'):
    p_h,p_w = pool_size
    Y = torch.zeros((X.shape[0]-p_h+1,X.shape[1]-p_w+1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i,j] = X[i:i+p_h,j:j+p_w].max()
            elif mode =='avg':
                Y[i,j] = X[i:i+p_h,j:j+p_w].mean()
             
     return Y

# 创建最大池化层，窗口大小为3x3
# 框架默认步幅与池化窗口大小相同，所以步幅也为3
pool2d = nn.MaxPool2d(3)
# 也可以手动指定padding和stride
pool2d = nn.MaxPool2d(3,padding=1,stride=2)
# 窗口大小也可以自己指定
pool2d = nn.MaxPool2d((2,3),padding=(1,2),stride=(2,3))

#池化层多个通道中，每个通道单独计算
# 2 x 4 x 4
X = torch.cat((X,X+1),1)

pool2d = nn.MaxPool2d(3,padding=1,stride=2)
# 2 x 2 x 2
pool2d(X)

总结

池化层返回窗口中的最大或者平均值，取决于操作子
池化层能缓解卷积层对位置的敏感性
同样有窗口大小、填充和步幅作为超参数

卷积层、全连接层参数个数计算

参考：CNN卷积层、全连接层的参数量、计算量

卷积层参数量：卷积核元素的大小
- 计算公式：参数量=（filter size * 前一层特征图的通道数）* 当前层filter数量
全连接层参数量：
- 计算公式：见参考

批量归一化

问题

底部：靠近数据

顶部：靠近损失

训练过程中，方差和均值的分布在不同层之间发生变化

在持续的训练过程中，上面的收敛比较快，下面变化比较慢。

但是每次底部变化，底层的信息变了，顶部需要进行重新训练。

这个问题会导致收敛比较慢。

原因

损失在网络计算的最后(上部)，网络后面的层的训练比较快
数据在网络的最底部
- 底层训练很慢
- 底层变化，整个网络的数据都发生变化
- 最后的那些层需要学习很多次
- 导致收敛变慢

解决方法

把不同层之间的均值和方差的分布固定住。

固定小批量里的均值和方差

\(\begin{split}\begin{aligned} \hat{\boldsymbol{\mu}}_\mathcal{B} &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x},\\ \hat{\boldsymbol{\sigma}}_\mathcal{B}^2 &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}})^2 + \epsilon.\end{aligned}\end{split}\)

\(\mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat{\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}.\)

\(\gamma \quad \beta\) 可学习的参数，当变为标准正态分布可能不太合适的话，可以用数据重新学习一个合适的方差和均值对数据进行优化。

批量归一化层

可学习的参数为\(\gamma\quad\beta\)
作用在
- 全连接层和卷积层输入上
- 全连接层和卷积层输出上，激活函数之前
  - 对输出数据减去均值除以方差，加上可以学习的\(\gamma\quad\beta\)，在加上激活函数
对于全连接层，作用在特征维
- 全连接层输入是二维的
- 行是样本，列是特征。
- 作用在特征维度是对特征求均值和方差（列）
对于卷积层，作用在通道维
- 二维拓展到三维
- Batch size x Channels x Height x Width
  - 样本数量： Batch size x Height x Width个像素
  - 每个像素：Channels 个通道(特征)
- 对每个像素的通道计算均值和方差

代码

import torch
from torch import nn
from d2l import torch as d2l

# moving_mean,moving_var 整个数据集的均值和方差
# eps 极小值 避免数据集中含有0
# momentum 更新moving_mean，moving_var
def batch_norm(X,gamma,beta,moving_mean,moving_var,eps,momentum):
    # 不算梯度，在做Inferance 推理
    if not torch.is_grad_enabled():
        # 为什么用全局的均值和方差
        # 做推理的时候很多时候输入只是一个样本，一般算不出自己的均值和方差
        X_hat = (X-moving_mean)/torch.sqrt(moving_var+eps)
    else:
        # 限定输入类型为全连接层和卷积层
        assert len(X.shape) in (2,4)
        if len(X.shape)==2:
            # 批量大小、特征维度
            # 对每一列的所有（行数个）样本求平均
            # dim=0 压缩行维度
            # 最后得到一个行向量
            mean = X.mean(dim=0)
            var = ((X-mean)**2).mean(dim=0)
        else:
            # 批量大小 通道数 高 宽 0 1 2 3
            # dim=(0,2,3) 把批量大小、高、宽的全部通道数求均值
            # keepdim = True 最终结果 1 x n x 1 x 1
            mean = X.mean(dim=(0,2,3),keepdim=True)
            var = ((X-mean)**2).mean(dim=(0,2,3),keepdim=True)
        X_hat = (X-mean) / torch.sqrt(var+eps)
        moving_mean = momentum * moving_mean +(1.0-momentum)*mean
        moving_var = momentum * moving_var + (1.0-momentum)*var
    Y = gamma * X_hat +beta
    
    return Y,moving_mean.data,moving_var.data

class BatchNorm(nn.Module):
    def __init__(self,num_features,num_dims):
        super().__init__()
        if num_dims==2:
            shape = (1,num_features)
        else:
            shape=(1,num_features,1,1)
        # nn.Parameter() 将张量注册为模块的参数
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta  = nn.Parameter(torch.zeros(shape))
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)
    
    def forward(self,X):
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        
        Y,self.moving_mean,self.moving_var = batch_norm(X,self.gamma,self.beta,self.moving_mean,self.moving_var,eps=1e-5,momentum=0.9)
        
        return Y

net = nn.Sequential(
    # 输入通道1 输出通道6 卷积核大小 5 
    nn.Conv2d(1,6,kernel_size=5),BatchNorm(6,num_dims=4),nn.Sigmoid(),nn.MaxPool2d(kernel_size=2,stride=2),
    nn.Conv2d(6,16,kernel_size=5),BatchNorm(16,num_dims=4),nn.Sigmoid(),nn.MaxPool2d(kernel_size=2,stride=2),
    nn.Flatten(),nn.Linear(16*4*4,120),
    BatchNorm(120,num_dims=2),nn.Sigmoid(),
    nn.Linear(120,84),BatchNorm(84,num_dims=2),
    nn.Sigmoid(),nn.Linear(84,10)
)

#简明版本 调用nn.BatchNorm2d()
net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5), nn.BatchNorm2d(6), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), nn.BatchNorm2d(16), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
    nn.Linear(256, 120), nn.BatchNorm1d(120), nn.Sigmoid(),
    nn.Linear(120, 84), nn.BatchNorm1d(84), nn.Sigmoid(),
    nn.Linear(84, 10))

1
2
3

lr, num_epochs, batch_size = 1.0, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

总结

批量归一化固定小批量中的均值和方差，然后学习出合适的偏移和缩放
可以加速收敛速度，但是一般不改变模型精度
可以允许用更大的学习率进行训练

LeNet 经典卷积神经网络

import torch
from torch import nn
from d2l import torch as d2l

class Reshape(torch.nn.Moudle):
    def forward(self,x):
        # view()相当于reshape、resize，对Tensor的形状进行调整。
        return x.view(-1,1,28,28)
    
net = nn.Sequential(
    Reshape(),
    # 输入为1 输出为6 卷积核 5x5 padding=2 28x28 --> 32x32
    # nn.Sigmoid()函数的作用？还不是很清楚
    nn.Conv2d(1,6,kernel_size=5,padding=2),nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2,stride=2),
    nn.Conv2d(6,16,kernel_size=5),nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2,stride=2),
    # 卷积层输出的是4D数据 1 x 16 x 5 x 5 1:批量
    # 拉平为一维数据输入多层感知机
    nn.Flatten(),
    nn.Linear(16*5*5,120),nn.Sigmoid(),
    nn.Linear(120,84),nn.Sigmoid(),
    nn.Linear(84,10)
)

# Reshape 批量：1 输出通道：1 高、宽 28
Reshape output shape: 	 torch.Size([1, 1, 28, 28])
# Padding之后：高、宽 32 批量：1 输出通道：6 核大小5 32-5+1 = 28 输出的高、宽还是28
Conv2d output shape: 	 torch.Size([1, 6, 28, 28])
# 激活函数 对shape不做操作
Sigmoid output shape: 	 torch.Size([1, 6, 28, 28])
# 平均池化层 核大小2 步长2
AvgPool2d output shape: 	 torch.Size([1, 6, 14, 14])
# 批量：1 输出通道：16 输出高、宽：10 (16个核 每个核对6个通道加权求和再输出)
Conv2d output shape: 	 torch.Size([1, 16, 10, 10])
# 激活函数 对shape不做操作
Sigmoid output shape: 	 torch.Size([1, 16, 10, 10])
# 批量：1 输出通道：16 输出高、宽：5
AvgPool2d output shape: 	 torch.Size([1, 16, 5, 5])
# 批量：1 拉平操作 将 1 x 16 x 5 x 5 的 4D 数据降低为一个维度
Flatten output shape: 	 torch.Size([1, 400])
# MLP 隐藏层 400->120
Linear output shape: 	 torch.Size([1, 120])
# 激活函数 对shape不做操作
Sigmoid output shape: 	 torch.Size([1, 120])
# MLP 隐藏层 120->84
Linear output shape: 	 torch.Size([1, 84])
# 激活函数 对shape不做操作
Sigmoid output shape: 	 torch.Size([1, 84])
# MLP  84->10
Linear output shape: 	 torch.Size([1, 10])


batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

def evaluate_accuracy_gpu(net, data_iter, device=None): #@save
    """使用GPU计算模型在数据集上的精度"""
    if isinstance(net, nn.Module):
        net.eval()  # 设置为评估模式
        if not device:
            # 设置训练的device
            device = next(iter(net.parameters())).device
    # 正确预测的数量，总预测的数量
    metric = d2l.Accumulator(2)
    with torch.no_grad():
        for X, y in data_iter:
            if isinstance(X, list):
                # BERT微调所需的（之后将介绍）
                X = [x.to(device) for x in X]
            else:
                X = X.to(device)
            y = y.to(device)
            metric.add(d2l.accuracy(net(X), y), y.numel())
    return metric[0] / metric[1]

#@save
def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
    """用GPU训练模型(在第六章定义)"""
    def init_weights(m):
        if type(m) == nn.Linear or type(m) == nn.Conv2d:
            # 全连接层和卷积层用xavier_uniform初始化
            # 根据输入输出大小，使得用随机数的时候，输入和输出的方差是差不多的
            nn.init.xavier_uniform_(m.weight)
    net.apply(init_weights)
    print('training on', device)
    net.to(device)
    # SGD优化
    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
    # 分类用
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                            legend=['train loss', 'train acc', 'test acc'])
    timer, num_batches = d2l.Timer(), len(train_iter)
    for epoch in range(num_epochs):
        # 训练损失之和，训练准确率之和，样本数
        metric = d2l.Accumulator(3)
        net.train()
        for i, (X, y) in enumerate(train_iter):
            timer.start()
            optimizer.zero_grad()
            X, y = X.to(device), y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
            with torch.no_grad():
                metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
            timer.stop()
            train_l = metric[0] / metric[2]
            train_acc = metric[1] / metric[2]
            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
                animator.add(epoch + (i + 1) / num_batches,
                             (train_l, train_acc, None))
        test_acc = evaluate_accuracy_gpu(net, test_iter)
        animator.add(epoch + 1, (None, None, test_acc))
    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
          f'test acc {test_acc:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
          f'on {str(device)}')

lr, num_epochs = 0.9, 10
train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

AlexNet 卷积神经网络

特点

更大更深的LeNet

主要改进

丢弃法
ReLU
MaxPooling

代码

import torch
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(
    # 这里使用一个11*11的更大窗口来捕捉对象。
    # 同时，步幅为4，以减少输出的高度和宽度。
    # 另外，输出通道的数目远大于LeNet
    nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    # 减小卷积窗口，使用填充为2来使得输入与输出的高和宽一致，且增大输出通道数
    nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    # 使用三个连续的卷积层和较小的卷积窗口。
    # 除了最后的卷积层，输出通道的数量进一步增加。
    # 在前两个卷积层之后，汇聚层不用于减少输入的高度和宽度
    nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Flatten(),
    # 这里，全连接层的输出数量是LeNet中的好几倍。使用dropout层来减轻过拟合
    nn.Linear(6400, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    # 最后是输出层。由于这里使用Fashion-MNIST，所以用类别数为10，而非论文中的1000
    nn.Linear(4096, 10))

# 查看数据再Net中如何变换
X = torch.randn(1, 1, 224, 224)
for layer in net:
    X=layer(X)
    print(layer.__class__.__name__,'output shape:\t',X.shape)

#nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1)
Conv2d output shape:	 torch.Size([1, 96, 54, 54])
ReLU output shape:	 torch.Size([1, 96, 54, 54])
# nn.MaxPool2d(kernel_size=3, stride=2)
MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
# nn.Conv2d(96, 256, kernel_size=5, padding=2)
Conv2d output shape:	 torch.Size([1, 256, 26, 26])
ReLU output shape:	 torch.Size([1, 256, 26, 26])
# nn.MaxPool2d(kernel_size=3, stride=2)
MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])\
# nn.Conv2d(256, 384, kernel_size=3, padding=1)
Conv2d output shape:	 torch.Size([1, 384, 12, 12])
ReLU output shape:	 torch.Size([1, 384, 12, 12])
# nn.Conv2d(384, 384, kernel_size=3, padding=1)
Conv2d output shape:	 torch.Size([1, 384, 12, 12])
ReLU output shape:	 torch.Size([1, 384, 12, 12])
# nn.Conv2d(384, 256, kernel_size=3, padding=1)
Conv2d output shape:	 torch.Size([1, 256, 12, 12])
ReLU output shape:	 torch.Size([1, 256, 12, 12])
# nn.MaxPool2d(kernel_size=3, stride=2)
MaxPool2d output shape:	 torch.Size([1, 256, 5, 5])
Flatten output shape:	 torch.Size([1, 6400])
# nn.Linear(6400, 4096)
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
# nn.Linear(4096, 4096)
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
# nn.Linear(4096, 10)
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 10])

VGG 使用块的网络

特点

更深更窄

堆更多的3 x 3的卷积核

VGG块

n层，m通道的卷积层
最后加个2 x 2的最大池化层

VGG架构

AlexNet整个卷积层的架构替换为n个VGG块，VGG-16，VGG-19
块可重复使用，不同的卷积块个数和超参数可以得到不同复杂程度的变种

代码

import torch
from torch import nn
from d2l import torch as d2l


def vgg_block(num_convs,in_channels,out_channels):
    layers = []
    # 单纯用个for循环，不需要循环变量
    for _ in range(num_convs):
        layers.append(nn.Conv2d(
        in_channels,out_channels,kernel_size=3,padding=1))
        layers.append(nn.ReLU())
        in_channels = out_channels
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
        # nn.Sequential() 接受orderDict或者Module数据
        # layers为 List类型数据 前面加上*表示将List中内容解包
        # 相当于把一个一个模型拿出来，放到Sequential中
    return nn.Sequential(*layers)

conv_arch = ((1,64),(1,128),(2,256),(2,512),(2,512))

def vgg(conv_arch):
    conv_blks = []
    in_channels = 1
    # 卷积层部分
    for (num_convs, out_channels) in conv_arch:
        conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
        in_channels = out_channels

    return nn.Sequential(
        *conv_blks, nn.Flatten(),
        # 全连接层部分
        nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5),
        nn.Linear(4096, 10))

net = vgg(conv_arch)

X = torch.randn(size=(1,1,224,224))
for blk in net:
    X = blk(X)
    print(blk.__class__.__name__,'output shape:\t',X.shape)

# 每个Sequential VGG块的作用，通道数加倍，高宽减半 
Sequential output shape:	 torch.Size([1, 64, 112, 112])
Sequential output shape:	 torch.Size([1, 128, 56, 56])
Sequential output shape:	 torch.Size([1, 256, 28, 28])
Sequential output shape:	 torch.Size([1, 512, 14, 14])
Sequential output shape:	 torch.Size([1, 512, 7, 7])
Flatten output shape:	 torch.Size([1, 25088])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 10])

# 减小批次
ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)

1
2
3

lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

NiN 神经网络

原因

卷积层参数个数: \(c_i \times c_o \times k^2\)
卷积层后第一个全连接层参数个数:
- 相当于用output_channels个\(1\times1\)的卷积核进行卷积计算
- \(c_i \times c_o \times k^2\)

块架构

一个卷积层后跟两个全连接层
- 步幅为1，无填充，输出形状跟卷积层输出一样
- 起到全连接层的作用

特点

无全连接层
交替使用NiN块和步幅为2的最大池化层
- 逐步减少高宽(减半)和增大通道数(加倍)
最后使用全局平均池化层得到输出
- 输入通道数是类别数
- 池化层的高宽=原始输入的高宽

代码

import torch
import torch.nn as nn
from d2l import torch as d2l

# 定义NiN块结构
def nin_block(in_channels,out_channels,kernel_size,stride,padding):
    return nn.Sequential(
    nn.Conv2d(in_channels,out_channels,kernel_size,stride,padding),nn.ReLU(),
    nn.Conv2d(out_channels,out_channels,kernel_size=1),nn.ReLU(),
    nn.Conv2d(out_channels,out_channels,kernel_size=1),nn.ReLU(),
    )

# 构建网络序列
net  = nn.Sequential(
    nin_block(1,96,kernel_size=11,stride=4,padding=0),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nin_block(96,256,kernel_size=5,stride=1,padding=2),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nin_block(256,384,kernel_size=3,stride=1,padding=1),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Dropout(0.5),
    nin_block(384,10,kernel_size=3,stride=1,padding=1),
    nn.AdaptiveAvgPool2d((1,1)),
    # 把数据拉成 batch_size x 10 
    nn.Flatten()
)

# 批次 1 大小 1 x 224 x 224
X = torch.rand(size=(1,1,224,224))
# 打印网络信息
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output:\t',X.shape)

# 训练
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

GoogLeNet 网络

特点

引入了Inception块

Inception块由四条并行路径组成。前三条路径使用窗口大小为1×1、3×3和5×5的卷积层，从不同空间大小中提取信息。中间的两条路径在输入上执行1×1卷积，以减少通道数，从而降低模型的复杂性。第四条路径使用3×3最大汇聚层，然后使用1×1卷积层来改变通道数。这四条路径都使用合适的填充来使输入与输出的高和宽一致，最后我们将每条线路的输出在通道维度上连结，并构成Inception块的输出。在Inception块中，通常调整的超参数是每层输出通道数。
基于Inception块，构建了网络

代码

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

# 定义Inception块
class Inception(nn.Module):
    # c1--c4是每条路径的输出通道数
    def __init__(self, in_channels, c1, c2, c3, c4, **kwargs):
        super(Inception, self).__init__(**kwargs)
        # 线路1，单1x1卷积层
        self.p1_1 = nn.Conv2d(in_channels, c1, kernel_size=1)
        # 线路2，1x1卷积层后接3x3卷积层
        self.p2_1 = nn.Conv2d(in_channels, c2[0], kernel_size=1)
        self.p2_2 = nn.Conv2d(c2[0], c2[1], kernel_size=3, padding=1)
        # 线路3，1x1卷积层后接5x5卷积层
        self.p3_1 = nn.Conv2d(in_channels, c3[0], kernel_size=1)
        self.p3_2 = nn.Conv2d(c3[0], c3[1], kernel_size=5, padding=2)
        # 线路4，3x3最大汇聚层后接1x1卷积层
        self.p4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.p4_2 = nn.Conv2d(in_channels, c4, kernel_size=1)

    def forward(self, x):
        p1 = F.relu(self.p1_1(x))
        p2 = F.relu(self.p2_2(F.relu(self.p2_1(x))))
        p3 = F.relu(self.p3_2(F.relu(self.p3_1(x))))
        p4 = F.relu(self.p4_2(self.p4_1(x)))
        # 在通道维度上连结输出
        return torch.cat((p1, p2, p3, p4), dim=1)

# 定义网络的各个部分
b1 = nn.Sequential(
    nn.Conv2d(1,64,kernel_size=7,stride=2,padding=3),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3,stride=2,padding=1)
)

b2 = nn.Sequential(nn.Conv2d(64, 64, kernel_size=1),
                   nn.ReLU(),
                   nn.Conv2d(64, 192, kernel_size=3, padding=1),
                   nn.ReLU(),
                   nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b3 = nn.Sequential(Inception(192, 64, (96, 128), (16, 32), 32),
                   Inception(256, 128, (128, 192), (32, 96), 64),
                   nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b4 = nn.Sequential(Inception(480, 192, (96, 208), (16, 48), 64),
                   Inception(512, 160, (112, 224), (24, 64), 64),
                   Inception(512, 128, (128, 256), (24, 64), 64),
                   Inception(512, 112, (144, 288), (32, 64), 64),
                   Inception(528, 256, (160, 320), (32, 128), 128),
                   nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b5 = nn.Sequential(Inception(832, 256, (160, 320), (32, 128), 128),
                   Inception(832, 384, (192, 384), (48, 128), 128),
                   nn.AdaptiveAvgPool2d((1,1)),
                   nn.Flatten())

net = nn.Sequential(b1, b2, b3, b4, b5, nn.Linear(1024, 10))

# 用小批量数据打印网络信息
X = torch.rand(size=(1, 1, 96, 96))
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

Sequential output shape:	 torch.Size([1, 64, 24, 24])
Sequential output shape:	 torch.Size([1, 192, 12, 12])
Sequential output shape:	 torch.Size([1, 480, 6, 6])
Sequential output shape:	 torch.Size([1, 832, 3, 3])
Sequential output shape:	 torch.Size([1, 1024])
Linear output shape:	 torch.Size([1, 10])

#训练
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

ResNet 残差网络

特点

保证新加入的层的使得效果至少不会变差

残差块

串联一个层改变函数类，扩大函数类
残差块加入快速通道得到\(f(x) = x + g(x)\)的结构

原来的向后传播，或者加入一个1x1卷积层来变换通道数量使得能够继续向后

即使中间的块什么都没有学到，之前的层的结果还是能够继续向后传播
类似VGG和GoogLeNet的总体架构，但是替换成了ResNet块

种类

高宽减半ResNet块（步幅为2）
后接多个高宽不变ResNet块

代码

from torch.nn import functional as F
from d2l import torch as d2l
import torch.nn as nn


# 定义Residual块
class Residual(nn.Module):
    def __init__(self,input_channels,num_channels,use_1x1conv=False,strides=1):
        super().__init__()
        self.conv1 = nn.Conv2d(input_channels,num_channels,kernel_size=3,padding=1,stride=strides)
        self.conv2 = nn.Conv2d(num_channels,num_channels,kernel_size=3,padding=1)
        
        if use_1x1conv:
            self.conv3 = nn.Conv2d(input_channels,num_channels,kernel_size=1,stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(num_channels)
        self.bn2 = nn.BatchNorm2d(num_channels)
        # inplace直接操作input，不另外开辟内存进ReLU操作
        # 原地更新
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self,X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        
        return F.relu(Y)


#将块组成模块，各个模块构成ResNet

b1 = nn.Sequential(
    nn.Conv2d(1,64,kernel_size=7,stride=2),
    nn.BatchNorm2d(64),nn.ReLU(),
    nn.MaxPool2d(kernel_size=3,stride=2,padding=1)
)

def resnet_block(input_channels,num_channels,num_residuals,first_block=False):
    blk = []
    for i in range(num_residuals):
        if i==0 and not first_block:
            blk.append(Residual(input_channels,num_channels,use_1x1conv=True,strides=2))
        else:
            blk.append(Residual(num_channels,num_channels))
    return blk

# *字典解包
b2 = nn.Sequential(*resnet_block(64,64,2,first_block=True))
b3 = nn.Sequential(*resnet_block(64,128,2))
b4 = nn.Sequential(*resnet_block(128,256,2))
b5 = nn.Sequential(*resnet_block(256,512,2))

net = nn.Sequential(b1,b2,b3,b4,b5,
                    nn.AdaptiveAvgPool2d((1,1)),
                    nn.Flatten(),nn.Linear(512,10))

1
2
3

lr, num_epochs, batch_size = 0.05, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())

总结

残差块使得很深的网络更加容易训练
后面的网络或多或少都有残差块的思想
下面小的先训练好，再训练深的，因为有跳转层

Notes

#DeepLearning

深度学习卷积笔记

https://anonymouslosty.ink/2023/06/03/深度学习卷积笔记/

作者

Ling yi

发布于

2023年6月3日

更新于

2023年6月15日

许可协议

深度学习 GPU与模型迁移笔记上一篇

深度学习神经网络基础笔记下一篇