深度学习 计算机视觉 笔记

本文最后更新于：2023年7月5日凌晨

边缘框

实现

#@save
def box_corner_to_center(boxes):
    """从（左上，右下）转换到（中间，宽度，高度）"""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = torch.stack((cx, cy, w, h), axis=-1)
    return boxes

#@save
def box_center_to_corner(boxes):
    """从（中间，宽度，高度）转换到（左上，右下）"""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    boxes = torch.stack((x1, y1, x2, y2), axis=-1)
    return boxes

# 添加到图里
#@save
def bbox_to_rect(bbox, color):
    # 将边界框(左上x,左上y,右下x,右下y)格式转换成matplotlib格式：
    # ((左上x,左上y),宽,高)
    return d2l.plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)

执行

# bbox是边界框的英文缩写
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]
fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

锚框

原理

锚框与边框

边框 bounding box bbx 真实值
锚框：预测框

特点

是一类目标检测算法
提出多个被称为锚框的区域
预测每个锚框内是否含有关注的物体
如果是，预测从这个锚框到真实边缘框的偏移

IoU 交并比

衡量两个框之间的相似度
约定两个集合A,B
- \[J(A,B)=\frac{|\mathbf A\cap \mathbf B|}{|\mathbf A\cup \mathbf B|}\]
- 0表示有无重叠，1表示有重叠

赋予锚框标号

每个锚框是一个训练样本
每个锚框，要么标注成背景，要么关联上一个真实边缘框
缺点
- 可能会生成大量的锚框，导致大量的负类样本

使用非极大值(NMS)抑制输出

每个锚框预测一个边缘框
NMS可以合并相似的预测
- 选中是非背景类的最大预测值
- 去掉所有其它和它IoU值大于\(\theta\)的预测
- 重复上述过程，要么预测被选中，要么被去掉

锚框生成方式

以每一个像素为中心
生成不同高度、宽度的锚框
- 锚框的宽度和高度分别为\(ws\sqrt{r}\)和\(hs/\sqrt{r}\)
s和r的取值，不是阶乘组合，而是只考虑包含s1或者r1的组合
- n+m-1 size+ratio-1
- \((s_1, r_1), (s_1, r_2), \ldots, (s_1, r_m), (s_2, r_1), (s_3, r_1), \ldots, (s_n, r_1).\)

总结

一类目标检测算法基于锚框来预测
首先生成大量锚框，并赋予标号，每个锚框作为一个样本进行训练
在预测时，使用NMS来去掉冗余的预测

代码实现

引入库并设置精度

%matplotlib inline
import torch
from d2l import torch as d2l

torch.set_printoptions(2)  # 精简输出精度

锚框的宽度和高度分别为\(ws\sqrt{r}\)和\(hs/\sqrt{r}\)

w照片宽度，h照片高度
s :scale，锚框占照片大小的百分比
r：ratio ，锚框的高宽比

物体检测算法

R-CNN

Region-CNN

特点

使用启发式搜索算法selective search来选择锚框
使用预训练模型pretraining-mode对每个锚框抽取特征
训练一个SVM来对类别分类
训练一个线性回归模型来预测边缘框偏移

一个图片变为一千张小图片，计算量很大

RoI Pooling 兴趣区域池化层

给定一个锚框，均匀分割为nxm块，输出每块里的最大值。
- 每个锚框的大小不同，但是作为分类依据的时候需要变为相同的大小，作为batch
不管锚框多大，总是输出nm个值

具体流程

对输入图像使用选择性搜索来选取多个高质量的提议区域(proposal region) (Uijlings et al., 2013)。这些提议区域通常是在多个尺度下选取的，并具有不同的形状和大小。每个提议区域都将被标注类别和真实边界框；
选择一个预训练的卷积神经网络，并将其在输出层之前截断。将每个提议区域变形为网络需要的输入尺寸(RoI Pooling)，并通过前向传播输出抽取的提议区域特征；
将每个提议区域的特征连同其标注的类别作为一个样本。训练多个支持向量机对目标分类，其中每个支持向量机用来判断样本是否属于某一个类别；
将每个提议区域的特征连同其标注的边界框作为一个样本，训练线性回归模型来预测真实边界框。

Fast RCNN

特点

使用CNN对图片进行特征抽取
使用ROI池化层对每个锚框生成固定长度特征

SSD

结构

基础网络块

2 x 3 x 256 x 256 --> 2 x 64 x 32 x 32 作用：抽特征，获取feature map(Y) 2 x 64 x 32 x 32

锚框生成

multibox_prior(data,size,ratio) 作用：根据data的高、宽以及锚框的size ratio生成锚框 32 x 32 x (3+2-1)

类别预测

cls_predictor(Y) 作用：根据feature map预测每个像素的锚框可能的类别 32 x 32 x (3+2-1) x (1+1)

边界预测

box_predictor(Y) 作用：根据feature map 预测每个锚框的位置 32 x 32 x (3+2-1) x 4

理解

特征图 2 x 64 x 32 x 32
用 cls_predictor Conv2d(64,8,kernel=3,padding=1) 从特征图中抽出8类作为新的特征
- 输出通道为8，8个滤波器
- 每个滤波器卷积核 3 x 3 x 64
- 专门抽类别特征的，卷积核里的参数慢慢训练为可以做类别分类的
用 box_predictor Conv2d(64,16,kernel=3,padding=1) 从特征图中抽出16个偏置(4x4)作为新的特征
- 输出通道为16，16个滤波器
- 每个滤波器卷积核 3 x 3 x 64
- 专门抽偏置特征的，卷积核里的参数慢慢训练为可以获取偏置的
无论输入图像的通道是多少，每经过一个滤波器都将生成一个通道为1的特征图。
一个卷积层之内可定义多个滤波器，当前卷积层上的各个滤波器会对上一层输入的每个feature map（特征图）分别执行卷积操作，即每个滤波器都会对应生成一个新的特征图feature map(不同的滤波器所提取的特征不同)。
故而在下一层需要多少个特征图，本层就需要定义多少个滤波器，即滤波器的个数与传出的特征图的张数一致。

代码

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

# num_anchors 每个像素的对应生成的锚框个数
# num_classes 预测的类别数
# num_anchors *(num_classes+1) 所有锚框所有预测的类别
def cls_predictor(num_inputs, num_anchors, num_classes):
    return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
                     kernel_size=3, padding=1)

# 预测和正确的bbx的偏移
def bbox_predictor(num_inputs, num_anchors):
    return nn.Conv2d(num_inputs, num_anchors * 4, 
                     kernel_size=3, padding=1)

# 通道数挪到最后，然后flatten
# start_dim = 1 把后面三个维度拉成一个向量
def flatten_pred(pred):
    return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)

# 拉成向量之后在第一个维度拼接
def concat_preds(preds):
    return torch.cat([flatten_pred(p) for p in preds], dim=1)

# 改变输出通道的数量 高宽减半块
def down_sample_blk(in_channels, out_channels):
    blk = []
    for _ in range(2):
        blk.append(nn.Conv2d(in_channels, out_channels,
                             # 不改变高宽
                             kernel_size=3, padding=1))
        blk.append(nn.BatchNorm2d(out_channels))
        blk.append(nn.ReLU())
        in_channels = out_channels
    # 用最大池化层把高宽减半
    blk.append(nn.MaxPool2d(2))
    return nn.Sequential(*blk)

# 基本网络块
def base_net():
    blk = []
    num_filters = [3, 16, 32, 64]
    for i in range(len(num_filters) - 1):
        blk.append(down_sample_blk(num_filters[i], num_filters[i+1]))
    return nn.Sequential(*blk)

# 3 -> 16 -> 32 -> 64
# 256 / 2^3 = 32
forward(torch.zeros((2, 3, 256, 256)), base_net()).shape

# 整体模型
def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 1:
        blk = down_sample_blk(64, 128)
    elif i == 4:
        # 最后，把feature map变为 1x1 
        blk = nn.AdaptiveMaxPool2d((1,1))
    else:
        blk = down_sample_blk(128, 128)
    return blk

# 定义前向计算 包含锚框处理的前向计算
def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    # 计算当前stage的 feature map
    Y = blk(X)
    # 计算锚框
    anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
    cls_preds = cls_predictor(Y)
    bbox_preds = bbox_predictor(Y)
    # 返回当前stage卷积层的输出 feature map
    # 卷积层输出上面的锚框
    # 对每一个锚框类别的预测
    # 每一个锚框对真实的锚框的偏移值的预测
    return (Y, anchors, cls_preds, bbox_preds)

# 定义超参数
sizes = [[0.2, 0.272], # stage 1
         [0.37, 0.447], 
         [0.54, 0.619], 
         [0.71, 0.79], 
         [0.88, 0.961]]# stage 5 锚框覆盖96%
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) - 1

# 定义完整网络
class TinySSD(nn.Module):
    def __init__(self, num_classes, **kwargs):
        super(TinySSD, self).__init__(**kwargs)
        self.num_classes = num_classes
        idx_to_in_channels = [64, 128, 128, 128, 128]
        for i in range(5):
            # 定义每个 stage blk、cls、bbx_predictor
            # 即赋值语句self.blk_i=get_blk(i)
            setattr(self, f'blk_{i}', get_blk(i))
            setattr(self, f'cls_{i}', cls_predictor(idx_to_in_channels[i],
                                                    num_anchors, num_classes))
            setattr(self, f'bbox_{i}', bbox_predictor(idx_to_in_channels[i],
                                                      num_anchors))
    # 完整的forward函数
    def forward(self, X):
        anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
        # 5 个 stage
        for i in range(5):
            # 拿到每一步blk_forward()的值
            # getattr(self,'blk_%d'%i)即访问self.blk_i
            X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
                getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
        anchors = torch.cat(anchors, dim=1)
        cls_preds = concat_preds(cls_preds)
        cls_preds = cls_preds.reshape(
            cls_preds.shape[0], -1, self.num_classes + 1)
        bbox_preds = concat_preds(bbox_preds)
        # 返回每一个层的Anchors 类别的预测和bbox的预测
        return anchors, cls_preds, bbox_preds

net = TinySSD(num_classes=1)
X = torch.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)

print('output anchors:', anchors.shape)
print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)

# 一共有 5444 个锚框 每个锚框由4个值确定
#output anchors: torch.Size([1, 5444, 4])
# 32个 批次 5444 个锚框 每个锚框 2 个类别
#output class preds: torch.Size([32, 5444, 2])
# 32个 批次 5444 个锚框 每个锚框的4个确定位置的值离真实bbx的偏移
#output bbox preds: torch.Size([32, 21776])

# 训练
batch_size = 32
train_iter, _ = d2l.load_data_bananas(batch_size)

device, net = d2l.try_gpu(), TinySSD(num_classes=1)
trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)

# 定义损失函数
# 其中包含了类的预测和损失函数的预测

cls_loss = nn.CrossEntropyLoss(reduction='none')
# 做减法取绝对值 防止距离过大导致损失过大
# 弃疗了，预测很差的锚框不需要，所以不需要这种损失算出来很大的锚框
# 也不需要算很大的损失
bbox_loss = nn.L1Loss(reduction='none')

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]
    cls = cls_loss(cls_preds.reshape(-1, num_classes),
                   cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1)
    # mask 当锚框为背景框的时候就不用预测偏倚
    bbox = bbox_loss(bbox_preds * bbox_masks,
                     bbox_labels * bbox_masks).mean(dim=1)
    return cls + bbox

def cls_eval(cls_preds, cls_labels):
    # 由于类别预测结果放在最后一维，argmax需要指定最后一维。
    return float((cls_preds.argmax(dim=-1).type(
        cls_labels.dtype) == cls_labels).sum())

def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

# 模型训练
num_epochs, timer = 20, d2l.Timer()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                        legend=['class error', 'bbox mae'])
net = net.to(device)
for epoch in range(num_epochs):
    # 训练精确度的和，训练精确度的和中的示例数
    # 绝对误差的和，绝对误差的和中的示例数
    metric = d2l.Accumulator(4)
    net.train()
    for features, target in train_iter:
        timer.start()
        trainer.zero_grad()
        # Y 真实的物体的bbx
        # 不能直接预测Y 要预测锚对应的类别和与真实bbx之间的偏移量
        X, Y = features.to(device), target.to(device)
        # 生成多尺度的锚框，为每个锚框预测类别和偏移量
        anchors, cls_preds, bbox_preds = net(X)
        # 为每个锚框标注类别和偏移量
        bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
        # 根据类别和偏移量的预测和标注值计算损失函数
        l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                      bbox_masks)
        l.mean().backward()
        trainer.step()
        metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
                   bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                   bbox_labels.numel())
    cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
    animator.add(epoch + 1, (cls_err, bbox_mae))
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
print(f'{len(train_iter.dataset) / timer.stop():.1f} examples/sec on '
      f'{str(device)}')

X = torchvision.io.read_image('../data/banana.jpg').unsqueeze(0).float()
img = X.squeeze(0).permute(1, 2, 0).long()
# 预测
def predict(X):
    net.eval() # 预测模式
    anchors, cls_preds, bbox_preds = net(X.to(device))
    cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1)
    output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
    idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
    # 只保留NMS留下的idx
    return output[0, idx]

output = predict(X)

def display(img, output, threshold):
    d2l.set_figsize((5, 5))
    fig = d2l.plt.imshow(img)
    for row in output:
        score = float(row[1])
        if score < threshold:
            continue
        h, w = img.shape[0:2]
        bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)]
        d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')

display(img, output.cpu(), threshold=0.9)

语义分割

转置卷积

特点

普通卷积算子不会增大输入的高、宽，通常要么不变，要么减半
转置卷积可以用来增大输入的高宽

原因

常用的卷积算子算到最后像素级别的数据被压缩到很小的范围
但是语义分割需要输出像素级别精度的类别，所以需要数据扩充

概念

转置卷积是一种计算方式
卷积一般做下采样，转置卷积一般做上采样
意义在于
- 如果你使用某种卷积使得输入\((h,w)\)变为\((h',w')\)
- 那么你可以使用同样超参数的转置卷积，使得输入\((h',w')\)变为\((h,w)\)
填充为0 步幅为1
- 计算方式1
  - 将输入填充k-1
  - 将核矩阵上下、左右翻转
  - 做正常卷积（填充0，步幅1）
- 计算方式2
  - 按转置卷积的方式计算
填充为p 步幅为1
- 计算方式1
  - 将输入填充k-p-1
  - 将核矩阵上下、左右翻转
  - 然后做正常卷积（填充0，步幅1）
- 计算方式2
  - 按转置卷积的方式计算
  - 在转置卷积的结果上去掉padding=1（外围一层）
填充为p 步幅为s
- 计算方法1
  - 在行和列之间插入s-1行或列
  - 将输入填充k-p-1
  - 将核矩阵上下、左右翻转
  - 然后做正常卷积（填充0，步幅1）
- 计算方法2
  - 按转置卷积的方式计算
形状换算
- 输入高（宽）为n，核k，填充p，步幅s
- 转置卷积：n'=sn+k-2p-s
  - 卷积：n'=[(n-k-2p+s)]/s
- 如果让高、宽成倍增加
  - k=2p+s

总结

转置卷积是一种变化了输入和核的卷积，用来达到上采样的目的
不等同于数学上的反卷积操作 ##### 案例


# 手撸版本
def trans_conv(X,K):
    h, w = K.shape
      # 设置输出形状
    Y = torch.zeros((X.shape[0]+h-1,X.shape[1]+w-1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i:i+h,j:j+w] += X[i,j]*K
    return Y


# API版本
# 设置转置卷积形状
tconv = nn.ConvTranspose2d(1,1,kernel_size=2,bias=False)
# 设置转置卷积参数
tconv.weight = K
# 计算
tconv(X)

填充

与常规卷积不同，在转置卷积中，填充被应用于的输出（常规卷积将填充应用于输入）。例如，当将高和宽两侧的填充数指定为1时，转置卷积的输出中将删除第一和最后的行与列。


# 填补Padding
# Padding = 1
# 原过程：在“输出”的基础上 上下填补计算普通卷积得到“输入”
# 转置卷积过程：输入计算转置卷积再去掉上下填补的部分
# padding=1 把不含padding转置卷积的结果除掉padding
tconv = nn.ConvTranspose2d(1,1,kernel_size=2,padding=1,bias=False)
tconv.weigth.data = K
tconv(X)


# 步长Stride

# 步长输出计算公式 和卷积相反
tconv = nn.ConvTranspose2d(1,1,kernel_size=2,bias=False)
tconv.weight.data = K
tconv(X)

# 与通道的关系
X = torch.rand(size=(1, 10, 16, 16))
conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3)
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3)
tconv(conv(X)).shape == X.shape

# 若卷积的过程中不发生向下取整则成立

全卷积网络(FCN)

定义

全卷积网络将中间层特征图的高宽变换回输入图像的尺寸。最后的通道维输出该位置对应像素的类别预测。

构造

全卷积网络先使用卷积神经网络抽取图像特征，然后通过卷积层将通道数变换为类别个数，最后在通过转置卷积层将特征图的高和宽变换为输入图像的尺寸。因此，模型输出与输入图像的高和宽相同，且最终输出通道包含了该空间位置像素的类别预测。

特点

双线性插值常用于初始化转置卷积层

代码

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

# 加载预训练模型，resnet18
pretrained_net = torchvision.models.resnet18(pretrained=True)
# resnet18 最后的全局平均池化层和全连接层不要
net = nn.Sequential(*list(pretrained_net.children())[:-2])

# 定义训练类别
num_classes = 21

# 网络最后两层改为 
# in_channel 512 out_channel 21类别的1x1卷积层 把通道数从512降到21
# in_channel 21类别 out_channel 21类别 把h w 还原为 [320,480]
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,kernel_size=64, padding=16, stride=32))


# 定义双线性上采样方法，用该方法初始化转置卷积的初始参数
def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (torch.arange(kernel_size).reshape(-1, 1),
          torch.arange(kernel_size).reshape(1, -1))
    filt = (1 - torch.abs(og[0] - center) / factor) * \
           (1 - torch.abs(og[1] - center) / factor)
    weight = torch.zeros((in_channels, out_channels,
                          kernel_size, kernel_size))
    weight[range(in_channels), range(out_channels), :, :] = filt
    return weight

1
2
3

# 初始化转置卷积层参数并应用参数
W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);

# 具体训练过程
batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)

def loss(inputs, targets):
    return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)

num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

# 预测
def loss(inputs, targets):
    return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)

num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

# 像素标号
def label2image(pred):
    colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
    X = pred.long()
    return colormap[X, :]
# 测试
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
    crop_rect = (0, 0, 320, 480)
    X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
    pred = label2image(predict(X))
    imgs += [X.permute(1,2,0), pred.cpu(),
             torchvision.transforms.functional.crop(
                 test_labels[i], *crop_rect).permute(1,2,0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);

样式迁移学习

思想

一套网络复制三次，三个卷积层
合成图像的像素值是最后的训练目标
损失计算
- 合成图像与样式图像计算样式损失
- 合成图像与内容图像计算内容损失
- 最后损失用权重汇总为总变差损失作为优化目标

代码

%matplotlib inline
import torch
import torchvision
from torch import nn
from d2l import torch as d2l

d2l.set_figsize()
# 加载图片
# 内容图片
content_img = d2l.Image.open('../data/banana-detection/bananas_train/images/753.png')
d2l.plt.imshow(content_img);
# 样式图片
style_img = d2l.Image.open('../img/autumn-oak.jpg')
d2l.plt.imshow(style_img);

rgb_mean = torch.tensor([0.485, 0.456, 0.406])
rgb_std = torch.tensor([0.229, 0.224, 0.225])
# 图像预处理
# preprocess 把图像转为Tensor
# Compose里存了 Resize，ToTensor和Normalize 有点像Sequential
# ToTensor 自动/255了
def preprocess(img, image_shape):
    transforms = torchvision.transforms.Compose([
        torchvision.transforms.Resize(image_shape),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(mean=rgb_mean, std=rgb_std)])
    return transforms(img).unsqueeze(0)

# postprocess
# 把Tensor还原为图像
# [0,1] 大于1取1 小于0取0 
def postprocess(img):
    img = img[0].to(rgb_std.device)
    img = torch.clamp(img.permute(1, 2, 0) * rgb_std + rgb_mean, 0, 1)
    return torchvision.transforms.ToPILImage()(img.permute(2, 0, 1))

# 加载预训练网络
pretrained_net = torchvision.models.vgg19(pretrained=True)

# 从预训练网络中load我们所需要的层
# VGG19中 [0, 5, 10, 19, 28]层作为style_layers 抽取样式图像特征
# VGG19中 [25]层作为content_layers 抽取内容图像特征
style_layers, content_layers = [0, 5, 10, 19, 28], [25]

# 构建网络
net = nn.Sequential(*[pretrained_net.features[i] for i in
                      range(max(content_layers + style_layers) + 1)])

# 定义特征抽取函数
# 逐层计算 保留每层style与content抽取的特征
def extract_features(X, content_layers, style_layers):
    contents = []
    styles = []
    for i in range(len(net)):
        X = net[i](X)
        if i in style_layers:
            styles.append(X)
        if i in content_layers:
            contents.append(X)
    return contents, styles

 # 调用特征抽取函数，获取各层抽取的特征
def get_contents(image_shape, device):
    # 预处理得到原始Content图像Tensor
    content_X = preprocess(content_img, image_shape).to(device)
    # 抽Contents特征
    contents_Y, _ = extract_features(content_X, content_layers, style_layers)
    return content_X, contents_Y

def get_styles(image_shape, device):
  	# 预处理得到原始Style图像Tensor
    style_X = preprocess(style_img, image_shape).to(device)
    # 抽Style图像特征
    _, styles_Y = extract_features(style_X, content_layers, style_layers)
    return style_X, styles

# 定义损失函数

#内容损失函数
def content_loss(Y_hat, Y):
    # Y_hat content_Y 抽取的Contents特征Tensor
    # Y: content_X 原始Content图像Tensor
    # Y不需要计算梯度，Y.detach()可以写在外面
    return torch.square(Y_hat - Y.detach()).mean()

# 定义损失函数

# 定格拉姆矩阵表达风格层输出的风格
def gram(X):
    num_channels, n = X.shape[1], X.numel() // X.shape[1]
    X = X.reshape((num_channels, n))
    return torch.matmul(X, X.T) / (num_channels * n)
# 定义风格损失函数
def style_loss(Y_hat, gram_Y):
    return torch.square(gram(Y_hat) - gram_Y.detach()).mean()

# 定义损失函数

# 定义TV损失函数（全变分,total variation denosing）
# 能够使得临近的像素值相似

def tv_loss(Y_hat):
    return 0.5 * (torch.abs(Y_hat[:, :, 1:, :] - Y_hat[:, :, :-1, :]).mean() +
                  torch.abs(Y_hat[:, :, :, 1:] - Y_hat[:, :, :, :-1]).mean())

# 定义损失函数

# 综合三个损失函数计算方案

content_weight, style_weight, tv_weight = 1, 1e4, 10

def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram):
    # 分别计算内容损失、风格损失和全变分损失
    contents_l = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip(
        contents_Y_hat, contents_Y)]
    styles_l = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip(
        styles_Y_hat, styles_Y_gram)]
    tv_l = tv_loss(X) * tv_weight
    # 对所有损失求和
    l = sum(styles_l + contents_l + [tv_l])
    return contents_l, styles_l, tv_l, l

# 初始化合成图像

# 合成的图像是训练期间唯一需要更新的变量，将合成图像视为模型的参数

class SynthesizedImage(nn.Module):
    def __init__(self, img_shape, **kwargs):
        super(SynthesizedImage, self).__init__(**kwargs)
        self.weight = nn.Parameter(torch.rand(*img_shape))

    def forward(self):
        return self.weight

# 定义网络初始化函数 
# 提前计算风格层的格拉姆矩阵
def get_inits(X, device, lr, styles_Y):
    gen_img = SynthesizedImage(X.shape).to(device)
    gen_img.weight.data.copy_(X.data)
    trainer = torch.optim.Adam(gen_img.parameters(), lr=lr)
    styles_Y_gram = [gram(Y) for Y in styles_Y]
    return gen_img(), styles_Y_gram, trainer

# 定义训练函数
def train(X, contents_Y, styles_Y, device, lr, num_epochs, lr_decay_epoch):
    X, styles_Y_gram, trainer = get_inits(X, device, lr, styles_Y)
    scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_decay_epoch, 0.8)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[10, num_epochs],
                            legend=['content', 'style', 'TV'],
                            ncols=2, figsize=(7, 2.5))
    for epoch in range(num_epochs):
        trainer.zero_grad()
        contents_Y_hat, styles_Y_hat = extract_features(
            X, content_layers, style_layers)
        contents_l, styles_l, tv_l, l = compute_loss(
            X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram)
        l.backward()
        trainer.step()
        scheduler.step()
        if (epoch + 1) % 10 == 0:
            animator.axes[1].imshow(postprocess(X))
            animator.add(epoch + 1, [float(sum(contents_l)),
                                     float(sum(styles_l)), float(tv_l)])
    return X

# 开始训练
device, image_shape = d2l.try_gpu(), (300, 450)
net = net.to(device)
content_X, contents_Y = get_contents(image_shape, device)
_, styles_Y = get_styles(image_shape, device)
output = train(content_X, contents_Y, styles_Y, device, 0.3, 500, 50)

Notes

#DeepLearning

深度学习计算机视觉笔记

https://anonymouslosty.ink/2023/06/18/深度学习计算机视觉笔记/

作者

Ling yi

发布于

2023年6月18日

更新于

2023年7月5日

许可协议

Kubernetes 配置与初始化上一篇

深度学习 GPU与模型迁移笔记下一篇