深度学习-目标检测

目标检测为计算机视觉领域内的相关任务，输入一张图片，输出图片中物体所在位置及物体的名称。是一个典型的多任务模型。

深度学习目标检测技术近年来发展迅速并且取得的成果相当棒，从最早的RCNN到Fast，Faster再到后来的SSD,YOLO等相关算法，不同的算法在目标检测识别正确率和运行速度之间在不算提升。

目标检测三部曲

Rcnn

虽然现在RCNN算法早已经被超越并抛弃了，但是作为深度学习的目标检测技术将CNN引入的开山之作还是有必要了解其相关算法的。RCNN的算法结构如下所示

论文链接：https://arxiv.org/pdf/1311.2524.pdf

pipeline

数据准备

输入要进行目标检测的图片，及其图片中物体的目标框(groundtruth)及对应的label
候选区生产

对图片使用ROI(region of interest)区域提取，常用selective search算法，对输入图片进行感兴趣区域的提取，一张图像能提取到2k左右的ROI
特征提取

将得到的ROI进行卷积CNN特征提取。将候选区域归一化成同一尺寸$227\times227$，使用的是ImageNet(卷积之后最终维度$4096$,之后全连接维度为$1000$，在Rcnn中去掉全连接层)
分类计算

CNN之后连接的是SVM进行分类计算，SVM进行每个类别的二分类计算(对于某一类来说，设计一个SVM来判断该类是否为正，为一个二分类任务)，这就说明，如果数据集中的物体类别数量为n，那么就需要设计n个SVM，由于在所有的ROI中负样本数量相对正样本数量很多，因此使用了hard negative mining的方法
- 正样本：输入的ROI为与label相同
- 负样本：考察每个候选框，如果和本类别的目标框的IOU小于$0.3$，认为负样本
目标框修正

对于目标检测任务，其评价指标一为预测的目标框与groundtruth之间IOU，另一个为类别是否正确。通过SS提取到的ROI一般和groundtruth 差别很大，就需要进行目标框的修正。

使用回归器训练目标检测框的准确性，输入同样为CNN之后得到的维度$4096$的向量，输出为xy方向的缩放和偏移。同样，对于每个分类训练一个目标检测框

作为深度学习目标检测的开山之作，RCNN在当年取得的识别正确率相对于传统的直方图方法是颠覆性的，由此，目标检测相关算法在此基础上不断发展。

RCN缺点

使用SS算法进行ROI的提取，对所有的Region会存在重复计算，会拖慢悬链速度
使用SVM和回归器进行分类和位置回归计算，不太符合深度学习端对端的方法，不太优雅并且每个类别对应一个分类器和回归器，这就需要训练数据要足够大才可以

FastRcnn

FastRcnn是对rcnn的改进，主要针对上文所述的RCNN存在的缺点。

将整张图像归一化后直接送入深度网络。在邻接时，才加入候选框信息，在末尾的少数几层处理每个候选框。
在训练时，将一张图像送入网络，紧接着送入从这幅图像上提取出的候选区域。这些候选区域的前几层特征不需要再重复计算。
类别判断和位置精调统一用深度网络实现，不再需要额外存储。

Pipeline

为了进行批量输入计算，图像归一化为$224 \times224$
特征提取

卷积网络CNN+RELU+POOLING
ROI_pooling_layer

仍然使用SS算法进行ROI提议得到2K个ROI

在特征图中找到每个ROI对应的特征框(进行比例缩放)，并使用max pool将每个特征框池化到固定的大小

forward实现方法
首先假设建议框对应特征图中的特征框大小为$h \times w$，将其划分$H \times W$个子窗口，每个子窗$h/H \times w/W$，然后对每个子窗口采用maxpooling下采样操作，每个子窗口只取一个最大值，则特征框最终池化为$H \times W$固定的尺寸【特征框各深度同理】，这将各个大小不一的特征框转化为大小统一的数据输入下一层。

backward

对于maxpool层，设$x_{i}$为ROI_pooling输入层的第$i$个节点，$y_{j}$为输出层第$r$个$ROI$的第$j$个节点，存在如下关系
$$
\begin{align*}
\frac{\partial L}{\partial x_{i}}=
\begin{cases}
& 0 :::::::::: & if :::::::::: &\delta(i,j)=false \
& \frac{\partial L}{\partial y_{j}} & if & \delta(i,j)=true
\end{cases}
\end{align*}
$$
判决函数$\delta(i,j)$表示节点$i$是否被节点$j$选择为最大值输出

对于ROI max pooling，一个输入节点可能和多个输出节点相连。设$x_{i}$为输入层的节点，$y_{rj}$为第$r$个候选区域的第$j$个输出节点
$$
\begin{align*}
\frac{\partial L}{\partial x_{i}}=\sum_{r}\sum_{j}\delta(i,r,j)\frac{\partial L}{\partial y_{rj}}
\end{align*}
$$
判决函数$\delta(i,r,j)$表示节点是否被第$r$个ROI的第$j$个节点选为最大输出
全连接计算

得到的固定大小的特征框$[r,H,W,Dims]$，其中使用全连接得到固定大小的特征向量$[r,H \times W \times Dims]$，共得到$r$个$ROI$的固定大小的向量

分类计算+预测框计算

cls_score为全连接之后得到的类别预测，输出为$k+1$维度，表示属于$k$类和背景类别的概率，使用$softmax$计算

bbox_predict为全连接之后得到的预测目标框，维度为$4 \times k$，对每个目标输出一个预测框

bbox_targets为目标值，对应着真实的目标框，维度为$4 \times 1$表示当前ROI中包含物体的目标框

bbox_loss_weights用于标记每一个预测的box是否属于某一个类，只有预测和真实类别相同时才进行loss计算

分类计算及代价函数

假设数据集中一共存在$K$个类别，输入为全连接之后的固定长度的向量$[r,H \times W \times Dims]$，输出为$[r,K+1]$，进行$K+1$个类别的输出，包含$K$个物体类别以及$1$个背景类，使用的是全连接的计算方式

代价函数

代价函数由真实的类别标签label和预测得到类别${pred}$经过softmax计算得到：
$$
\begin{align*}
L_{cls}(label,{pred})&=-log({pred}_{u})\
&=-ulog(softmax(pred))
\end{align}
$$
说明：假设数据集中包含的目标类别数量为$100$对于单个的ROI进行类别判别时，输出的数据维度为$[1, (100+1)]$,对应着$100$个数据集中的目标类别和$1$个背景类别，假设当前输入的ROI中目标数量为1，并且对应着1个类别为$1$，对应的数据格式为$[1,0,0,…]$，也就是在相应的类别处数据为1，其余位置均为0。在进行计算时，使用$sofmax$函数，将类别预测的概率控制在$[0,1]$之间

$softmax$函数计算公式为
$$
\sigma(x_{j})=\frac{e^{x_{j}}}{\sum_{k=1}^{K}e^{x_{k}}}
$$
在进行类别计算时，计算的是预测值在所有类别中的概率。得到输出的类别预测的概率，将其与真实$label$相乘，得到损失。

预测框计算

每个预测框的label包含四个坐标$box=[x,y,h,w]$(左上角的$x,y$坐标值及真实框的高度和宽度)，因此对应的输出维度为$[r,(K+1)*4]$

代价函数

真实类别$u$对应的目标框为$v=[v_{x},v_{y},v_{w},v_{h}]$，预测的目标$u$对应的类别为$t^{u}=[t^{x},t^{y},t^{w},t^{h}]$，可以由此计算
$$
L_{loc}=\sum_{i \in\lbrace x,y,w,h\rbrace}smooth_{L_{1}}(t_{i}^{u}-v_{i})
$$
其中
$$
\begin{align*}
smooth_{L_{1}}=
\begin{cases}
0.5|x|^{2}::::::::::& if |x|<1 \
|x|-0.5 & otherwise
\end{cases}
\end{align*}
$$
该损失函数对应的是检测框的偏移量，比于L2损失函数，其对离群点、异常值不敏感，可控制梯度的量级使训练时不容易跑飞。如下图所示在同样的输入x的前提下，橙色曲线为L2损失函数，蓝色线条为smoothL1损失函数。

这里需要说明一点：对于单个的ROI，输出的向量为$[1,4 \times (k+1)]$ ，对应着所以类别预测的目标框，但是在当前的ROI中，对应的标签为该ROI中对应的目标类别和该目标的位置，因此在进行预测框损失函数的计算是，只考虑输出的的所有数据中对应的输入ROI的类别的位置的数据，假设当前的ROI中包含一个目标，类别分别为1，整个样本中的所有类别为$100$，因此，对于当前的位置计算时，预测框的输出为$[1,4 \times 100]$，但是在计算位置损失时，只是用输出数据的第1行的数据，其余的输出数据不参与损失计算

代码实现

rois = tf.placeholder(tf.int32,[None, 5], name='rois')
y_true = tf.placeholder(tf.float32, [None, class_num*5-4], name='labels')
logits=slim.fully_connected(drop7, class_num,activation_fn=nn_ops.softmax ,scope='fc_8')
bbox = bbox = slim.fully_connected(drop7, (class_num-1)*4,                                     
                                        activation_fn=None ,scope='fc_9')
cls_pred = logits
# label中数据的存储格式为[cls_1,cls_2,...,cls_nums,box_1_1,box_1_2,box_1_3,....]
# 前面存储的是label，后面存储的是每个label对应的groundtruth
cls_true = y_true[:, :class_num]
bbox_pred = bbox
bbox_ture = y_true[:, class_num:]

cls_pred /= tf.reduce_sum(cls_pred,
                          reduction_indices=len(cls_pred.get_shape()) - 1,
                          keep_dims=True)
cls_pred = tf.clip_by_value(cls_pred, tf.cast(1e-10, dtype=tf.float32), tf.cast(1. - 1e-10, dtype=tf.float32))
cross_entropy = -tf.reduce_sum(cls_true * tf.log(cls_pred), reduction_indices=len(cls_pred.get_shape()) - 1)
cls_loss = tf.reduce_mean(cross_entropy)
tf.losses.add_loss(cls_loss)
tf.summary.scalar('class-loss', cls_loss)

mask = tf.tile(tf.reshape(cls_true[:, 1], [-1, 1]), [1, 4])
for cls_idx in range(2, self.class_num):
    mask =tf.concat([mask, tf.tile(tf.reshape(cls_true[:, int(cls_idx)], [-1, 1]), [1, 4])], 1)
bbox_sub =  tf.square(mask * (bbox_pred - bbox_ture))
bbox_loss = tf.reduce_mean(tf.reduce_sum(bbox_sub, 1))
tf.losses.add_loss(bbox_loss)
tf.summary.scalar('bbox-loss', bbox_loss)

代码连接：

https://github.com/Liu-Yicheng/Fast-RCNN

https://github.com/rbgirshick/fast-rcnn/blob/master/tools/train_net.py

Tricks

参数初始化

使用ImageNet上预训练的模型，并去除最后的全连接层
分层数据

在调优训练时，每一个mini-batch中首先加入$N$张完整图片，而后加入从$N$张图片中选取的$R$个候选框。这$R$个候选框可以复用$N$张图片前5个阶段的网络特征。
实际选择$N=2$， $R=128$
训练数据增强

$N$张完整图片以$50%$概率水平翻转
全连接层加速

正常的全连接层，假设输入为$x$，输出为$y$，中间的参数为$W=u \times v$，对应的计算复杂度为$u \times v$
$$
y=Wx
$$
对权重参数$W$进行SVD分解
$$
W\approx U \sum_{t}V^{T}
$$
其中各个变量的尺寸为$U=u \times v$，$\sum_{t}=t\times t$，$V=v \times t$，如此全连接计算变为
$$
y=Wx= U.(\sum_{t}.V^{T}).x
$$
对应的计算复杂度为$u \times t +v \times t$，如此实现了将一个全连接拆分成两个低纬度的全连接，节约了计算资源提升计算速度。

FasterRcnn

作为FastRcnn的继承和改进，主要针对ROI生成的改进，在FastRcnn中使用Selective Search的方法进行ROI的生成，FasterRcnn开发了用于生成ROI的卷积网络RPN-region proposal network，将整个目标检测的过程归纳为一个端到端的网络，完全可以看成RPN+FastRcnn两个网络的联合。此外，在RPN中提出了锚点anchor的概念，后续的SSD,YOLO都借鉴了这种思路

#### pipeleine

整个计算流程可以归纳如下

对图像归一化并resize至固定尺寸$800 \times 600$
对图像进行CNN操作，对图像进行特征提取，在该部分，使用的是CNN_3X3_SAME+MAX_POOL_2_2_VALID，也就是说CNN的卷积核尺寸均为$3 \times 3$，并且pooling=SAME,不会改变输入图像的高度和宽度，并且输出的feature map的深度取决于卷积核的数量；使用了最大池化的处理，并且池化的尺寸为$2 \times 2$，在高度和宽度方向上的步长为$2$，在深度方向上为$1$。在特征提取部分，一共存在$4$个最大池化层，假设输入的图像尺寸为$[N,H,W,C]$因此在下采样之后的feature map为$[N, {H}\div {2^{4}}, {W} \div {2^{4}},D]$
上部得到feature_map，之后对该特征图进行ROI提取，在Faster_Rcnn中，使用了一个RPN网络进行感兴趣区域的提取。具体可以参考下一部分的RPN的详细讲解
RPN进行感兴趣区域的提取之后，进行分类计算和预测框回归

分类计算
- 在RPN这里只判断提取到的区域内是物体或不是物体，也就是说，这里只进行一个二分类，判断foreground还是background
预测框回归
- 计算RPN提取到的区域用于真实目标区域的偏移量
由于RPN提取到的ROI很多，使用在RPN中提取到的foreground进行一定的删选限制，进行后去的第二次的分类和回归计算
分类和回归的计算方法与fast_rcnn相似

分类计算
- 计算提取到的区域对应的目标的类别
回归计算
- 再次对目标框进行回归计算，进一步精修目标检测框

RPN网络

网络结构如图所示

在图rpn_structure中，模型输入的图片的尺寸为$[N,H,W,C]$，经过卷积(vgg16)进行特征提取，得到的feature map尺寸为$[N,H/16,W/16,D]$，此时$D=512$，由最后一层卷积操作的卷积核输出的数量决定。对于该feature map会分开两条路径进行分类和预测框回归计算。

分类计算
- RPN分类概率rpn_cls_score $size=[N,H,W,2A]$
  
  由当前的feature map预测每个cell对应的是否为目标的概率。
  
  由卷积网络得到的feature map经过$1 \times 1$卷积，此外卷积核数量为$2A$，$A$为每一个cell生成的anchor box的数量，得到了尺寸为$[N,H,W,2A]$，可以认为这是对feature map中的每个cell值，进行是否为目标的预测。对应图中rpn_cls_score。此处可以认为对feature进行卷积计算实际上就是一个预测的过程，并且设置卷积核的数量为$2A$，为后续进行计算是前景和背景的概率准备。
- 生成anchor box
  
  输入的feature map尺寸为$[N,H,W,2A]$，每一个cell生成A个anchor box，$A=9$对应着三种长宽比例，这是一种多尺度预测的思想.
  - 分类损失计算rpn_labels，$size=[H\times W\times A,1]$
    
    得到featuremap中每个cell生成的anchor box是否为目标的标签
    
    将生成的这些anchor box与输入的目标框ground_truth之间的重合面积作为删选标准，如果重合面积小于$0.3$认为这个anchor box是background，重合面积大于$0.7$的认为这个anchor box 是foreground。如此将rpn_labels中的背景类设为$0$，前景设为$1$。
  - 目标框损失计算rpn_bbox_targets,$size=[H \times W \times A,4]$
    
    记录生成的anchor box目标框与ground truth之间的偏移量，可以认为偏移量就是损失
    
    假设anchor box对应坐标为$[x{al},y_{al},x_{ar},y_{ar}]$，对应着该矩形框的左上角的坐标和右下角坐标，ground truth的矩形框的坐标为$[x_{l},y_{l},x_{r},y_{r}]$，由此可以得到矩形的宽度高度以及中心坐标为
    $$
    \begin{align}
    width&=x_{r}-x_{l}+1.0 \
    height&=y{r}-y_{l}+1.0 \
    center_{x}&=x_{l}+0.5 \times width \
    center_{y}&=y_{l}+0.5 \times height
    \end{align}
    $$
    可以得到生成的anchor box和ground truth 之间的偏移量，计算方法如下
    $$
    \begin{align}
    target_{x}&=(groundtruth_{center_{x}}-anchor_{center_{x}}) / groundtruth_{width}\
    target_{y}&=(groundtruth_{center_{y}}-anchor_{center_{y}}) / groundtruth_{height}\
    target_{w}&=log(\frac{groundtruth_{width}}{anchor_{width}})\
    target_{w}&=log(\frac{groundtruth_{height}}{anchor_{height}})
    \end{align}
    $$

通过该偏移量指标作为损失指标

回归损失权重

rpn_bbox_inside_weights：每个anchor的权重，前景为$1$，后景为$0$

rpn_bbox_outside_weights：因为后景物体相对前景数量多很多，该变量用来平衡前景和后景的数量。认为前景目标对应权重为$1$；背景目标对应权重为$0$，不计入损失。
回归计算

由当前的feature map预测每个cell对应anchor box的矩形框坐标

与RPN进行分类损失计算相似，同样适用$1 \times 1$卷积，卷积核数量为$4A$，可以认为在当前的feature map中，任一cell对应生成的A个anchor，每个anchor对应着$4$个值，这$4$个值认为是预测的目标的坐标值。

至此，完成了RPN的中anchor box对应的分类和回归项的准备，对应图中名称
- rpn_label：当前anchor是前景为$1$，后景为$0$
- rpn_cls_score：预测当前anchor为前景和后景的概率
使用softmax_logit进行分类损失计算
- rpn_box_target：当前anchor与ground truth的偏移量
- rpn_box_inside_weight,rpn_bbox_outside_weights权重变量
- rpn_box_pred：预测当前anchor的矩形位置
使用$ Smooth_{L1}$进行预测框偏移量损失计算，其公式在fast rcnn中已然有写

ROI_POOLING

从RPN中得到anchor box

在该块中，输入的是在RPN得到的anchor box，因为产生的提议区域非常多，并且存在大比例的负样本，因此，使用经过RPN训练之后的bounding box:[t_x,t_y,t_w,t_h](由于RPN使用提取的anchor box和ground truth之间的偏移量作为该阶段网络的损失训练指标，因此$[t_x,t_y,t_w,t_h]$对应着偏移量)和anchor box$[x_{al},y_{al},x_{ar},y_{ar}]$生成区域提议proposal box:[xp,yp,wp,hp]，对anchor box进行偏移量的修正，并生成经过RPN回归训练之后得到的修正的anchor box作为proposal box。

对于原始anchor box来说，内部存储的是$[x_{al},y_{al},x_{ar},y_{ar}]$，因此可以得到
$$
\begin{align}
width&=x_{r}-x_{l}+1.0 \
height&=y{r}-y_{l}+1.0 \
center_{x}&=x_{l}+0.5 \times width \
center_{y}&=y_{l}+0.5 \times height
\end{align}
$$
加入RPN得到的偏移量，可以得到ROI的提议框
$$
\begin{align}
xl_{proposal}&=t_{x}*anchor_{width}+center_{x}-0.5 \times exp^{t_{width}} \times anchor_{width} \
yl_{proposal}&=t_{y}*anchor_{height}+center_{y}-0.5 \times exp^{t_{height}} \times anchor_{height} \
xr_{proposal}&=t_{x}*anchor_{width}+center_{x}+0.5 \times exp^{t_{width}} \times anchor_{width}\
yr_{proposal}&=t_{y}*anchor_{height}+center_{y}+0.5 \times exp^{t_{height}} \times anchor_{height}
\end{align}
$$
经过公示变换，至此得到了修正之后的anchor box作为初始提议框。

由于anchor box的数量为$H \times W \times A$，即CNN卷积之后的feature map的长宽以及每个cell对应的anchor数量，为了降低运算量，设置了一些删选条件
- 使用RPN得到的前景概率，进行排序并取对应的pre_nms_topN个proposal box
- 限制最小的box面积
- 使用最大值抑制的方法，去除nums<0.7的box
通过上述的删选条件，对于每张图片，得到一个固定数量的proposal box，设数量为n
固定proposal box尺寸

由于proposal box的尺寸大小不一，而后续的分类计算和回归计算用的是全连接方式，因此使用ROI POOLING的方式将提取到的proposal box固定到同意尺寸，后续具体的做法与fast rcnn模型相同

分类+回归

将ROI提取到的proposal使用全连接方式进行分类和回归计算，分类采用softmax，回归采用smoothL1

总结

在Faster Rcnn中，使用RPN进行了初步的ROI的提取，在该部分进行分类计算和预测框的回归计算：

分类主要是预测提取的anchor box是前景和背景的概率
预测框回归主要是针对提取到的anchor box和ground truth的偏移量

训练RPN网络，得到了进行第一次预测的目标框，以及该目标框对应的前景和背景的概率。

由RPN得到的anchor box的偏移量和前景的概率，使用前景概率，nms，最小面积等条件，进一步筛选得到要输入到ROI POOLING的proposal

使用ROI POOLING对proposal进行尺寸统一，之后进行全连接目标分类预测和预测框的再次精修

由于进行了两次的目标框提取以及分类计算，faster rcnn的模型准确率目前来讲也是非常高的，但是由此也带来了大量的计算量，计算效率不是太高

代码参考：https://github.com/xiaofengShi/CV/tree/master/Faster-RCNN_TF

SSD目标检测算法

SSD: Single Shot MultiBox Detector目标检测，结合了Faster rcnn中多尺度检测的思想和YOLO使用单个深度网络进行检测方法，没有使用faster rcnn中在固定的feature map上进行anchor box生成，而是在多个下采样的feature map上进行预测框生成，并将多个feature map上的分类预测和目标框的回归预测进行合并，计算整个模型的总体损失。

SSD框架的核心思想：低层特征图感受野较小，用它去检测小物体，高层特征图感受野较大，用它去检测大物体。但是实际上这样存在一个问题，低层卷积语义信息很弱，无法对后续的分类有很好的帮助，导致小目标的检测提升其实不是很大。

论文链接：https://arxiv.org/abs/1512.02325

代码链接：https://github.com/xiaofengShi/CV/tree/master/SDC-vehicle-dection

网络结构如图所示

SSD的网络参数设置，代码来自https://github.com/xiaofengShi/CV/blob/master/SDC-vehicle-dection/nets/ssd_vgg_300.py

"""
Implementation of the SSD VGG-based 300 network.
The default features layers with 300x300 image input are:
    conv4 ==> 38 x 38
    conv7 ==> 19 x 19
    conv8 ==> 10 x 10
    conv9 ==> 5 x 5
    conv10 ==> 3 x 3
    conv11 ==> 1 x 1
The default image size used to train this network is 300x300.
"""
default_params = SSDParams(
            img_shape=(300, 300), # image size 
            num_classes=8, # classes want to predict
            no_annotation_label=9, # label idx has no annotation
            # feature layer infos to generate anchors and predict 
            feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
            # feature shape 
            feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
            # 最底层anchor box 默认框和最顶层默认框的大小
            anchor_size_bounds=[0.15, 0.90],
            # anchor box size (min_size,max_size)
            anchor_sizes=[(21., 45.),
                          (45., 99.),
                          (99., 153.),
                          (153., 207.),
                          (207., 261.),
                          (261., 315.)],
            # anchor boax ratios for each feature layer
            anchor_ratios=[[2, .5],
                           [2, .5, 3, 1./3],
                           [2, .5, 3, 1./3],
                           [2, .5, 3, 1./3],
                           [2, .5],
                           [2, .5]],
    		# anchor_steps is the result origin image shape divide current feature 
            anchor_steps=[8, 16, 32, 64, 100, 300],
            anchor_offset=0.5,
            # if normalization or not for each feature map, if bigger than 0 yes ,else no 
            normalizations=[20, -1, -1, -1, -1, -1],
            prior_scaling=[0.1, 0.1, 0.2, 0.2])

从SSD的模型结构中，可以看出，整体结构使用的是VGG16作为特征提取生成层feature layer，直接代码说话。在SSD的网络搭建中，使用的是slim，这是一个tensorflow中封装的一个轻量级库，可以提高工程速度，tensorflow使用熟悉之后可以使用slim的API，但是有个不好的地方就是相应的API文档不够完善，对于VGG16等具有代表性的卷积模型，tensorflow models 已经进行了编写，内部存储着常用的已然完成编写的高质量代码轮子。https://github.com/tensorflow/models/tree/master/research/slim/nets

补充卷积尺寸计算公式

如果padding=SAME，计算公式为
$$
\begin{align}
height_{out}&=ceil[height_{input} \div height_{stride}] \
width_{out}&=ceil[width_{input} \div width_{stride}]
\end{align}
$$
如果padding=VALID，计算公式为
$$
\begin{align}
height_{out}&=ceil[(height_{input}-height_{kernel}+1) \div height_{stride}] \
width_{out}&=ceil[(width_{input}-width_{kernel}+1) \div width_{stride}]
\end{align}
$$

def ssd_net(inputs,
            num_classes=SSDNet.default_params.num_classes,
            feat_layers=SSDNet.default_params.feat_layers,
            anchor_sizes=SSDNet.default_params.anchor_sizes,
            anchor_ratios=SSDNet.default_params.anchor_ratios,
            normalizations=SSDNet.default_params.normalizations,
            is_training=True,
            dropout_keep_prob=0.5,
            prediction_fn=slim.softmax,
            reuse=None,
            scope='ssd_300_vgg'):
    """
    SSD net definition.
    """
    # End_points collect relevant activations for external use.
    end_points = {}
    with tf.variable_scope(scope, 'ssd_300_vgg', [inputs], reuse=reuse):
        # Original VGG-16 blocks.
        # two conv_3x3_1_64 and one max_pool_2x2_2
        net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
        end_points['block1'] = net
        # feature size [n,150,150,64] after maxpool
        net = slim.max_pool2d(net, [2, 2], scope='pool1')
        
        # Block 2.
        # two conv_3x3_1_128 and one max_pool_2x2_2
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
        end_points['block2'] = net
        # feature size [n,75,75,128] after maxpool
        net = slim.max_pool2d(net, [2, 2], scope='pool2')
        
        # Block 3.
        # three conv_3x3_1_256 and one max_pool_2_2
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
        end_points['block3'] = net
        # feature size [n,38,38,256] aft maxpool
        net = slim.max_pool2d(net, [2, 2], scope='pool3')
        
        # Block 4.
        # three conv_3x3_1_512 and one max_pool_2_2
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
        end_points['block4'] = net # predict feature layer [n,38,38,512]
        # feature size [n,19,19,512] aft maxpool
        net = slim.max_pool2d(net, [2, 2], scope='pool4')
        
        # Block 5.
        # three conv_3x3_1_512 and one max_pool_3_1 
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
        end_points['block5'] = net 
        # feature size [n,19,19,512] aft maxpool
        net = slim.max_pool2d(net, [3, 3], 1, scope='pool5')

        # Additional SSD blocks.
        # Block 6: let's dilate the hell out of it!
        # one conv_3x3_1_1024, rate is the dilation rate to use for atrous convolution
        net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
        end_points['block6'] = net  # feature size is [n,19,19,1024]
        # Block 7: 1x1 conv. Because the fuck.
        net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
        end_points['block7'] = net # feature size is [n,19,19,1024]

        # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
        end_point = 'block8'
        # input feature size is [n,19,19,1024]
        # out feature size is [n,10,10,512]
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
            net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3')
        end_points[end_point] = net
        end_point = 'block9'
        # input feature size is [n,10,10,512]
        # out feature size is [n,5,5,256]
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3')
        end_points[end_point] = net
        end_point = 'block10'
        # input feature size is [n,5,5,256]
        # out feature size is [n,3,3,256] 
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
        end_points[end_point] = net
        end_point = 'block11'
        # input feature size is [n,3,3,256]
        # out feature size is [n,1,1,256]
        with tf.variable_scope(end_point):
            net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
            net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
        end_points[end_point] = net

        # Prediction and localisations layers.
        predictions = []
        logits = []
        localisations = []
        # copy parameters from above code
        # feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11']
        # feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)]
        # anchor_sizes=[(21., 45.),(45., 99.),(99., 153.),(153., 207.),(207., 261.),(261., 315.)],
        # anchor_ratios=[[2, .5],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5],[2, .5]],
        # normalizations=[20, -1, -1, -1, -1, -1]
        # num_classes=8
        for i, layer in enumerate(feat_layers):
            with tf.variable_scope(layer + '_box'):
                # predict the class and location for each feature layer
                # p is the class property
                # l is the location
                # for each cell in a feature layer use cnov to generate
                	# cls: k_anchor_percell*numclass
                    # loc: k_anchor_percell*4
                p, l = ssd_multibox_layer(end_points[layer],
                                          num_classes,
                                          anchor_sizes[i],
                                          anchor_ratios[i],
                                          normalizations[i])
            predictions.append(prediction_fn(p))
            logits.append(p) # predict class 
            localisations.append(l) # predict location

        return predictions, localisations, logits, end_points

在代码中的conv6的计算过程中，使用了扩展卷积，可以实现不增加参数量的情况下扩大卷积的视野，net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')，卷积核的尺寸为$3 \times 3$，扩张率为$6$，在图中(a),(b),(c)对应的是普通卷积，视野$3 \times 3$；扩张率为1，视野$7 \times 7$；以及扩张率为$3$，视野$15 \times 15$。

Anchor_box_generator

对于SSD网络结构中，使用多个feature layer进行预测以及默认框的生成，生成的方式和faster rcnn相似，同样是在每个featur map的每个cell为中心点进行anchor box的生成。由于使用了多个不同尺寸的feature layer进行多尺度预测，论文中给出了进行默认框生成的方法

默认框生成的个数

根据上述的SSD网络结构，一共使用了$6$个不同尺寸的feature layer进行预测，在每个feature layer中存在$h \times w$个中心点(h、w为feature layer的高度和宽度)，每个中线点会产生k个默认框，并且$6$个层中对应的$k=[4,6,6,6,4,4]$,总共生成的默认框数量为$8732$个

$$
\begin{align}
TotalBoxNums=&38384+19196+10106+556+334+114 \
=&8732
\end{align}
$$

# feature layer infos to generate anchors and predict 
feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
# feature shape 
feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
anchor_ratios=[[2, .5],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5],[2, .5]],

中心点的计算公式为
$$
\begin{align}
center_{x}&=\frac{i+0.5}{|feature_{k}|} ::::::::::::::: & (1)\
center_{y}&=\frac{j+0.5}{|feature_{k}|} & (2)\
where & :::::: i,j \in[0,|feature_{k}|] & condation
\end{align}
$$
其中$|feature_{k}|$ 为第$k$个feature layer的尺寸

计算每个feature layer的min_size和max_size

计算对应于ssd模型参数中的anchor_size，内部存储的是要生成默认框的尺寸参数，对于每层的anchor_size包含两个量，分别为min_size和max_size

1	anchor_sizes=[(21., 45.),(45., 99.),(99., 153.),(153., 207.),(207., 261.),(261., 315.)],

根据论文所述默认框的尺度线性增加
$$
s_{k}=s_{min}+\frac{s_{max}-s_{min}}{m-1}(k-1),:::::: k \in[1,m]
$$
其中$m$指特征图的数量，在SSD网络结构中$m=5$，因为对于第一层block4中，是进行单独设置的，$s_{k}$表示默认框的大小相对于输入的图片的比例，$s_{min}$和$s_{max}$对应比例的最小值和最大值，对应着SSD参数中的anchor_size_bounds=[0.15, 0.90]=[s_min,s_max],对于第一个特征图论文中给定的比例一般设置为$s_{min}/2=0.1$但是在代码中使用的是$0.07$，对于剩余的特征图，默认框按照上述所示的公式进行线性增加，可以计算得出，每一个feature layer对应的比例增加的步长为$floor(\frac{s_{max}-s_{min}}{m-1}*100)=18$,可以得到剩余的feature layer的尺寸为$15,33,51,69,87$，这些数对应着默认框尺寸对应的图片尺寸的比例，可以最终得到各个特征图的默认框在输入图片尺寸为$300 \times 300$的时候为$45,99,153,207,261$，这就得到了anchor_size最小尺寸的值min_size。对应的max_size的尺寸为当前min_size加上步长所得，计算代码如下，

# 图像尺寸
img_size = 300    
# feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)], 
mbox_source_layers = ['block4', 'block7', 'block8', 'block9', 'block10', 'block11'] 
# in percent %
# 论文中所说的Smin=0.2，Smax=0.9的初始值，经过下面的运算即可得到min_sizes，max_sizes  
min_ratio = 15
max_ratio = 90  
# 计算步长 
step = int(math.floor((max_ratio - min_ratio) / (len(mbox_source_layers) - 2)))
min_sizes = []  
max_sizes = []
####从min_ratio至max_ratio+1每隔step=18取一个值赋值给ratio。
for ratio in range(min_ratio, max_ratio + 1, step):  
    min_sizes.append(min_dim * ratio / 100.)  
    max_sizes.append(min_dim * (ratio + step) / 100.)  
# 添加第一层的尺寸
min_sizes = [min_dim * 7 / 100.] + min_sizes  
max_sizes = [min_dim * 15 / 100.] + max_sizes
aspect_size=list(zip(min_sizes,max_sizes))

运算上面的代码可以得到

1
2
3

min_sizes=[21.0, 45.0, 99.0, 153.0, 207.0, 261.0]
max_sizes=[45.0, 99.0, 153.0, 207.0, 261.0, 315.0]
aspect_size=[(21.0, 45.0),(45.0, 99.0),(99.0, 153.0),(153.0, 207.0),(207.0, 261.0),				(261.0, 315.0)]

由此就得到了SSD模型参数中给定的aspect_size，一般在代码中直接给定了计算好的aspect_size但是并没有给出计算方法，在此进行重点分析，并理解其计算方法。

为每一个预测的feature layer生成默认框的计算大小

根据上述计算得到的aspect_ratios和设定的默认框的比例aspect_ratios进行计算

# aspect box which contain the min_size and max size for each feature layer
aspect_size=[(21.0, 45.0),(45.0, 99.0),(99.0, 153.0),(153.0, 207.0),(207.0, 261.0),				(261.0, 315.0)]
# anchor boax ratios for each feature layer
anchor_ratios=[[2, .5],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5],[2, .5]],

每个中心点会根据min_size和max_size生成两个正方形的默认框，其边长分别为

小正方形边长：min_size
大正方形边长：sqrt(min_size*max_size)

根据给定的边长比例，一个比例对应生成一个长方形矩形框，对应的长方形的高度和宽度为

长方形的高：min_size/sqrt(aspect_ratio)
长方形的宽：min_size*sqrt(aspect_ratio)

anchor_box的生成具体如图所示：

根据给定aspect_size和anchor_ratios的参数，对于每个feature layer生成的anchor box的数量为len(aspect_size)+len(anchor_ratios)，对于每层生成anchor box的程序代码如下:

def ssd_anchor_one_layer(img_shape,
                         feat_shape,
                         sizes,
                         ratios,
                         step,
                         offset=0.5,
                         dtype=np.float32):
    	# generate anchor centers for current feature layer
    	y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
        y = (y.astype(dtype) + offset) * step / img_shape[0]
        x = (x.astype(dtype) + offset) * step / img_shape[1]

        # Expand dims to support easy broadcasting.
        y = np.expand_dims(y, axis=-1)
        x = np.expand_dims(x, axis=-1)
        
        # Compute relative height and width.
        # Tries to follow the original implementation of SSD for the order.
        num_anchors = len(sizes) + len(ratios)
        
        h = np.zeros((num_anchors, ), dtype=dtype)
        w = np.zeros((num_anchors, ), dtype=dtype)
        # Add first anchor boxes with ratio=1.
        h[0] = sizes[0] / img_shape[0]
        w[0] = sizes[0] / img_shape[1]
        di = 1
        # generate square 
        if len(sizes) > 1:
            h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]
            w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
            di += 1
        # generate rectangle based ratios 
        for i, r in enumerate(ratios):
            h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)
            w[i + di] = sizes[0] / img_shape[1] * math.sqrt(r)
        return y, x, h, w

ground truth 处理

在训练时将label中的信息（ground_truth_box, ground_truth_category）进行预处理，将其对应到上述生成的默认框上。根据默认框和ground_truth_box的jaccard重叠来确定默认框，在代码中使用jaccard重叠超过$0.5$的默认框为正样本，其余的为负样本。具体策略为：

原则1：首先找到每个ground_truth_box对应的默认框中IOU最大的作为正样本，该默认框与其进行匹配；
原则2：在剩余的默认框中找到与任意一个ground_truth_box的IOU>0.5的默认框作为正样本；

一个ground_truth对应着多个正样本默认框，但是反过来确实不行的，一个默认框只能匹配一个ground truth，如果多个ground truth与某个默认框的IOU都大于设定的阈值，该默认框只能匹配IOU最大的那个ground truth。

特殊情形：如果某个ground truth对应的最大的IOU小于阈值，并且所匹配的默认框却与另一个ground truth的IOU大于阈值，那么该默认框应该匹配前者，以确保某个ground truth一定存在一个默认框与之匹配。此种情形出现的可能性不大，因为按照生成anchor box，全部覆盖了图像的各个区域，因此只实现第二个原则即可。

#label和bbox编码函数
def tf_ssd_bboxes_encode_layer(labels,
                               bboxes,
                               anchors_layer,
                               num_classes,
                               no_annotation_label,
                               ignore_threshold=0.5,
                               prior_scaling=[0.1, 0.1, 0.2, 0.2],
                               dtype=tf.float32):
    """
    Encode groundtruth labels and bounding boxes using SSD anchors from
    one layer.

    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors_layer: Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.

    Return:
      (target_labels, target_localizations, target_scores): Target Tensors.
      feat_localizations:shape is [h,w,k_box_percell,4]
      feat_labels: shape is [h,w,k_box_percell]
      feat_scores: shape is [h,w,k_box_percell]
    """
    # Anchors coordinates and volume.
    # anchor_layer 对应着上面生成的anchor_box函数ssd_anchor_one_layer 
    yref, xref, href, wref = anchors_layer
    ymin = yref - href / 2. # 左上角y
    xmin = xref - wref / 2. # 左上角x
    ymax = yref + href / 2. # 右下角y
    xmax = xref + wref / 2. # 右下角x
    vol_anchors = (xmax - xmin) * (ymax - ymin) # 面积

    # Initialize tensors...
    shape = (yref.shape[0], yref.shape[1], href.size) # (h,w,4)
    feat_labels = tf.zeros(shape, dtype=tf.int64)  #
    feat_scores = tf.zeros(shape, dtype=dtype)
    #shape为（h,w,4）
    feat_ymin = tf.zeros(shape, dtype=dtype)
    feat_xmin = tf.zeros(shape, dtype=dtype)
    feat_ymax = tf.ones(shape, dtype=dtype)
    feat_xmax = tf.ones(shape, dtype=dtype)

    #计算jaccard重合
    def jaccard_with_anchors(bbox):
        """
        Compute jaccard score a box and the anchors.
        """
        # Intersection bbox and volume.
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)

        # Volumes.
        inter_vol = h * w
        union_vol = vol_anchors - inter_vol \
            + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
        jaccard = tf.div(inter_vol, union_vol)
        return jaccard
    
   	def intersection_with_anchors(bbox):
        """Compute intersection between score a box and the anchors.
        """
        int_ymin = tf.maximum(ymin, bbox[0])
        int_xmin = tf.maximum(xmin, bbox[1])
        int_ymax = tf.minimum(ymax, bbox[2])
        int_xmax = tf.minimum(xmax, bbox[3])
        h = tf.maximum(int_ymax - int_ymin, 0.)
        w = tf.maximum(int_xmax - int_xmin, 0.)
        inter_vol = h * w
        scores = tf.div(inter_vol, vol_anchors)
        return scores
    #条件函数 
    def condition(i, feat_labels, feat_scores,
                  feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Condition: check label index.
        """
        #tf.less函数 Returns the truth value of (x < y) element-wise.
        r = tf.less(i, tf.shape(labels))
        return r[0]
    #主体
    def body(i, feat_labels, feat_scores,
             feat_ymin, feat_xmin, feat_ymax, feat_xmax):
        """Body: update feature labels, scores and bboxes.
        Follow the original SSD paper for that purpose:
          - assign values when jaccard > 0.5;
          - only update if beat the score of other bboxes.
        """
        # Jaccard score.
        label = labels[i]
        bbox = bboxes[i]
        jaccard = jaccard_with_anchors(bbox)
        # Mask: check threshold + scores + no annotations + num_classes.
        mask = tf.greater(jaccard, feat_scores)
        # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
        mask = tf.logical_and(mask, feat_scores > -0.5)
        mask = tf.logical_and(mask, label < num_classes)
        imask = tf.cast(mask, tf.int64)
        fmask = tf.cast(mask, dtype)
        # Update values using mask.
        feat_labels = imask * label + (1 - imask) * feat_labels
        feat_scores = tf.where(mask, jaccard, feat_scores)

        feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
        feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
        feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
        feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax

        # Check no annotation label: ignore these anchors...
        interscts = intersection_with_anchors(bbox)
        mask = tf.logical_and(interscts > ignore_threshold,
                              label == no_annotation_label)
        # Replace scores by -1.
        feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)

        return [i+1, feat_labels, feat_scores,
                feat_ymin, feat_xmin, feat_ymax, feat_xmax]
    # Main loop definition.
    i = 0
    [i, feat_labels, feat_scores,
     feat_ymin, feat_xmin,
     feat_ymax, feat_xmax] = tf.while_loop(condition, body,
                                           [i, feat_labels, feat_scores,
                                            feat_ymin, feat_xmin,
                                            feat_ymax, feat_xmax])   
    # Transform to center / size.
    #计算补偿后的中心
    feat_cy = (feat_ymax + feat_ymin) / 2.
    feat_cx = (feat_xmax + feat_xmin) / 2.
    feat_h = feat_ymax - feat_ymin
    feat_w = feat_xmax - feat_xmin
    # Encode features. 
    feat_cy = (feat_cy - yref) / href / prior_scaling[0]
    feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
    feat_h = tf.log(feat_h / href) / prior_scaling[2]
    feat_w = tf.log(feat_w / wref) / prior_scaling[3]
    # Use SSD ordering: x / y / w / h instead of ours.
    # feat_localizations shape is [h,w,k_box_percell,4]
    # feat_labels shape os [h,w,k_box_percell]
    # feat_scores shape is [h,w,k_box_percell]
    feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
    return feat_labels, feat_localizations, feat_scores


#ground truth编码函数
def tf_ssd_bboxes_encode(labels,#ground truth标签，1D tensor
                         bboxes,#N×4 Tensor（float）
                         anchors,#anchors，为list
                         matching_threshold=0.5,#阀值
                         prior_scaling=[0.1, 0.1, 0.2, 0.2],#缩放
                         dtype=tf.float32,
                         scope='ssd_bboxes_encode'):
    """Encode groundtruth labels and bounding boxes using SSD net anchors.
    Encoding boxes for all feature layers.

    Arguments:
      labels: 1D Tensor(int64) containing groundtruth labels;
      bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
      anchors: List of Numpy array with layer anchors;
      matching_threshold: Threshold for positive match with groundtruth bboxes;
      prior_scaling: Scaling of encoded coordinates.

    Return:
      (target_labels, target_localizations, target_scores):
        Each element is a list of target Tensors.
    """
    with tf.name_scope(scope):
        target_labels = []
        target_localizations = []
        target_scores = []
        for i, anchors_layer in enumerate(anchors):
            with tf.name_scope('bboxes_encode_block_%i' % i):
                #将label和bbox进行编码
                t_labels, t_loc, t_scores = \
                    tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer,
                                               matching_threshold, prior_scaling, dtype)
                target_labels.append(t_labels)
                target_localizations.append(t_loc)
                target_scores.append(t_scores)
        return target_labels, target_localizations, target_scores

hard_negative_mining

尽管一个ground truth可以与多个先验框匹配，但是ground truth相对先验框还是太少了，所以负样本相对正样本会很多。为了保证正负样本尽量平衡，SSD采用了hard negative mining，就是对负样本进行抽样，抽样时按照置信度误差（预测背景的置信度越小，误差越大）进行降序排列，选取误差的较大的top-k作为训练的负样本，以保证正负样本比例接近$1:3$，这样做能提高$4 %$左右。

Data_augmentation

为了模型更加鲁棒，需要使用不同尺寸的输入和形状，作者对数据进行了如下方式的随机采样：

使用整张图片
使用IOU和目标物体为0.1, 0.3，0.5, 0.7, 0.9的patch （这些 patch 在原图的大小的 [0.1,1] 之间，相应的宽高比在[1/2,2]之间）
随机采取一个patch
当ground truth box的中心（center）在采样的 patch 中时，我们保留重叠部分。在这些采样步骤之后，每一个采样的 patch 被 resize 到固定的大小，并且以 $0.5$ 的概率随机的水平翻转（horizontally flipped）。用数据增益通过实验证明，能够将数据mAP增加$8.8%$。

LOSS

对每个feature layer上的每个匹配过的默认框进行分类损失和目标框回归损失计算，目标函数为
$$
L(x,c,l,g)=\frac{1}{N}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))
$$
其中$N$为匹配的默认框，正样本数目，如果$N=0$，$loss=0$，$L_{conf}$为预测框l和ground truth$g$的SmoothL1 损失，$\alpha$为平衡参数。

smoothL1计算公式为
$$
\begin{align*}
smooth_{L_{1}}=
\begin{cases}
0.5|x|^{2}::::::::::& if |x|<1 \
|x|-0.5 & otherwise
\end{cases}
\end{align*}
$$

$$
\begin{align*}
& L_{loc}(x,l,g)=\sum_{i \in Pos} \sum_{m \in \lbrace cx,cy,w,h \rbrace}x_{ij}^{k}smooth_{L1}(l_{i}^{m}-g_{j}^{m}) \

\end{align*}
$$

对于式中的参数，$\hat{g}$为ground truth与匹配的默认框进行编码所得，’l’为预测的目标框，cx,cy为经过补偿之后的默认框d的中心，w,h为默认框的宽度和高度。
$$
\begin{align}
& \hat{g}{j}^{cx}=(g^{cx}{j}-d_{j}^{cx})/d_{i}^{w} :::::::::::::::::: & \hat{g}{j}^{cy}=(g^{cy}{j}-d_{j}^{cy})/d_{i}^{h} \
& \hat{g}{j}^{w}=log(\frac{g{j}^{w}}{d_{i}^{w}}) & \hat{g}{j}^{h}=log(\frac{g{j}^{h}}{d_{i}^{h}})
\end{align}
$$
由此可以得出，损失是减小ground truth的编码值与预测值的之间的差异，因此在模型预测时，要对得到的模型进行反向解码才能得到对应的目标框在图像中的位置。

对于分类损失一般使用softmax交叉熵损失：
$$
L_{conf}(x,c)=-\sum_{i \in Pos}^{N}x_{ij}^{p}log(c_{i}^{p})-\sum_{i \in Neg}log(c_{i}^{0} :::::::::: where :::::::c_{i}^{p}=\frac{exp(c_{i}^{p})}{\sum_{p}exp(c_{i}^{p})})
$$
具体代码

def ssd_losses(logits, localisations,
               gclasses, glocalisations, gscores,
               match_threshold=0.5,
               negative_ratio=3.,
               alpha=1.,
               label_smoothing=0.,
               scope='ssd_losses'):
    """Loss functions for training the SSD 300 VGG network.

    This function defines the different loss components of the SSD, and
    adds them to the TF loss collection.

    Arguments:
      logits: (list of) predictions logits Tensors;
      localisations: (list of) localisations Tensors;
      gclasses: (list of) groundtruth labels Tensors;
      glocalisations: (list of) groundtruth localisations Tensors;
      gscores: (list of) groundtruth score Tensors;
    """
    with tf.name_scope(scope):
        l_cross_pos = []
        l_cross_neg = []
        l_loc = []
        for i in range(len(logits)):
            dtype = logits[i].dtype
            with tf.name_scope('block_%i' % i):
                # Determine weights Tensor.
                pmask = gscores[i] > match_threshold
                fpmask = tf.cast(pmask, dtype)
                n_positives = tf.reduce_sum(fpmask)

                # Select some random negative entries.
                # n_entries = np.prod(gclasses[i].get_shape().as_list())
                # r_positive = n_positives / n_entries
                # r_negative = negative_ratio * n_positives / (n_entries - n_positives)

                # Negative mask.
                no_classes = tf.cast(pmask, tf.int32)
                predictions = slim.softmax(logits[i])
                nmask = tf.logical_and(tf.logical_not(pmask),
                                       gscores[i] > -0.5)
                fnmask = tf.cast(nmask, dtype)
                nvalues = tf.where(nmask,
                                   predictions[:, :, :, :, 0],
                                   1. - fnmask)
                nvalues_flat = tf.reshape(nvalues, [-1])
                # Number of negative entries to select.
                n_neg = tf.cast(negative_ratio * n_positives, tf.int32)
                n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)
                n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
                max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
                n_neg = tf.minimum(n_neg, max_neg_entries)

                val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
                minval = val[-1]
                # Final negative mask.
                nmask = tf.logical_and(nmask, -nvalues > minval)
                fnmask = tf.cast(nmask, dtype)

                # Add cross-entropy loss.
                with tf.name_scope('cross_entropy_pos'):
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=gclasses[i])
                    loss = tf.contrib.losses.compute_weighted_loss(loss, fpmask)
                    l_cross_pos.append(loss)

                with tf.name_scope('cross_entropy_neg'):
                    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
                                                                          labels=no_classes)
                    loss = tf.contrib.losses.compute_weighted_loss(loss, fnmask)
                    l_cross_neg.append(loss)

                # Add localization loss: smooth L1, L2, ...
                with tf.name_scope('localization'):
                    # Weights Tensor: positive mask + random negative.
                    weights = alpha * fpmask
                    loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i])
                    loss = tf.contrib.losses.compute_weighted_loss(loss, weights)
                    l_loc.append(loss)

        # Total losses in summaries...
        with tf.name_scope('total'):
            tf.summary.scalar('cross_entropy_pos', tf.add_n(l_cross_pos))
            tf.summary.scalar('cross_entropy_neg', tf.add_n(l_cross_neg))
            tf.summary.scalar('cross_entropy', tf.add_n(l_cross_pos + l_cross_neg))
            tf.summary.scalar('localization', tf.add_n(l_loc))

FPN

对于卷积神经网络而言，不同深度对应着不同层次的语义特征，浅层网络分辨率高，学的更多是细节特征，深层网络分辨率低，学的更多是语义特征。针对SSD中所述的低层特征图包含的语义特征较少，无法对预测小物体提供明显的帮助，每层分别预测不同scale的目标，这样没有对不同层的语义信息加以考虑，并且直接强行让不同层学习同样的语义信息。

多尺度物体检测面临的主要挑战是：

如何学习具有强语义信息的多尺度特征表示
如何设计通用的特征表示来解决物体检测过程中的多个子问题？如proposal, box localization, instance segmentation
如何高效计算多尺度的特征表示

在FPN中提出了先下采样后上采样的的结构，并将特征图相同的上下采样的特征图进行跨步连接，使用融合之后的特征进行检测。

论文地址：https://arxiv.org/abs/1612.03144

网络结构如下所示

从模型结构图中可以看出，整个模型先进行自底向上下采样，论文中使用的是resnet网络作为基础网络，在下采样过程中，一般将大小尺寸不变的feature layer层归为一个stage，因此每次抽取的特征是每个stage的最后一个层输出，由此，对于resnet抽取出4个大小不用的feature layer。

自顶向下上采样，该过程以下采样的最后一层feature layer为开始进行上采样，扩大feature layer的长宽尺寸，横向连接是将下采样结果和上采样结果中相同大小的feature layer进行merge，为了消除上采样带来的混叠效应(aliasing effect)，在merge之后使用conv3x3对融合的结果进行卷积操作，最终生成目标检测的feature layer 。

FPN-RPN

在FPN网络结构中，仍然使用RPN网络进行区域生成，在FasteRcnn中，RPN只接受主网络某个卷积层输出的feature map作为输入，只有一个尺度的feature map，但是在FPN中，将预测层均使用RPN进行区域生成，在每个sacle层都定义了大小不同的anchor尺寸，分别是$32^2$,$64^2$,$128^2$,$256^2$.$512^2$，并且每个scale层都有三个长宽比$1:2$,$1:1$,$2:1$三种，也就是说，总共会生成$15$中不同尺寸的anchor。

正负样本的判定和FasterRcnn相差不多，如果某个anchor与groundtruth有最高的IOU或者和任意一个groundtruth的IOU都大于$0.7$，则为正样本，如果某个anchor与任意一个groundtruth的IOU都小于$0.3$，则判定为负样本。

总结

在论文中对比了FPN中不同网络结构设计证实了该网络结构的对识别准确率的提升，同时利用低层特征高分辨率和高层特征的高语义信息，通过融合不同层的特征达到预测的效果，并且预测在每个融合后的特征层上单独进行，可以认为是对SSD的一种改进。

YOLOV1~V3

为了表示对作者的致敬，先挂上作者的主页和TED演讲。

现在已经到YOLO的第三个版本了，个人认为是目前最棒的一个目标检测算法，本来早就该写了的，拖延症一直到现在才刚开始动笔。

自诞生之日起，YOLO就被贴上了两个标签：

速度快
不擅长检测小物体

YOLOv1

网络结构图如下

YOLOv1中，只对最后一层的卷积输出，并使用全连接的方式进行预测输出，由于卷积结构中，我们老生常谈的问题：随着卷积层的增加，feature layer中包含的语义信息越来越多，但是对于小物体，在高层卷积层中几乎已经没有信息了，由此很难是被出来。

在最后一层中feature map的尺寸为$S\times S$，可以认为使用$S\times S$个网格对该层的feature map进行分割如果某个物体的groundtruth的中心位置落在某个格子中，那么这个各自就负责检测这个物体。对应于每个各自预测B个boundingbox及置信度confidence score以及C个类别概率。boundingbox为(x,yw,h)分别对应着和物体的中心位置相对格子位置的偏移和宽度，均被归一化。置信度反应是否包含物体以及包含物体情况下位置的准确性，定义为$P_{r}(Object) \times IOU_{pred}^{truth}$，也就是预测位置与groundtruth之间的IOU。

整个网络结构包含$24$个卷积层，$2$个全连接层，相较于同时代的faster那是真的快，但是是被效果也是真的没法和faster比，但是这种网络结构的设计是的具有实时性的目标检测成为可能，算是real-time detector的鼻祖了，可以说是功在当代了，由于是被精度太差，因此在自动驾驶中，一般落地的用的是上面说的FPN。

YOLOv2

在YOLOv1之后，第二年作者一系列的骚操作创造出了Yolov2，也叫YOLO9000。

维度聚类Dimension Clusters
- 引入了anchor机制，对VOC数据集和COCO数据集的bbox进行聚类分析，将原来常用的3尺寸，3比例的anchor进行删减，保留最常出现的5中anchor。
**直接位置预测Direct location prediction **

修改了常用的FasterRcnn那一套进行位置损失的计算方法，设计了新的位置位置偏移计算方式。

在FasterRcnn中位置回归计算如下：

边界框的实际中心位置$(x,y)$，需要根据预测的坐标偏移值$(t_x,t_y)$，先验框的尺度$(w_a,w_h)$以及中心坐标$(x_a,y_a)$来计算
$$
\begin{align}
x&=(t_x \times w_a)-x_a \
y&=(t_y \times h_a)-y_a
\end{align}
$$
但是对于上面的公式，由于不存在约束项，因此预测的边框有可能向任意一个方向产生偏移，当$t_x=1$时边界框向有偏移先验框的一个宽度的大小，当$t_x=-1$时边界框将向左偏移先验框的一个宽度大小，因此每个位置预测的边界框可以落在图片的任何一个位置，这就会导致整个模型不稳定，要训练很久才会收敛，预测出正确的offset。

Yolov2中没有使用这种预测方式，而是继续采用了YOLOv1中的方法，预测边框中心点相对于cell左上角位置的相对偏移值，为了将边界框中心店约束在当前的cell中，由于每个cell的尺度可以看做长宽皆为1，因此使用sigmoid进行偏移值处理，使偏移值在$(0~1)$之间。根据anchor预测得到的边界框$(t_x,t_y,t_w,t_h)$，边界框的实际位置和大小的计算方法如下：
$$
\begin{align}
b_x&=\sigma (t_x)+c_x\
b_y&=\sigma(t_y)+c_y\
b_w&=p_w \exp(t_w) \
b_h&=p_h \exp (t_h)
\end{align}
$$
其中$(c_x,c_y)$是当前cell的左上角的坐标，$\sigma$为sigmoid函数。在计算的时候每个cell的尺度都是1，如图YOLOv2_loca中所示，所以当前cell的左上角的坐标为$(1,1)$，由于使用的sigmoid函数，边界框的中心位置会约束在当前cell的内部，也就是图中的$b_x,b_y$坐标，$p_w,p_h$是先验框的宽度和长度，他们的值是相对于当前特征图大小进行过缩放的，在特征图中每个cell的长和宽都是1，此处特征图的大小记作$(W,H)$，（在论文中初始的输入图像为$416 \times 416$，经过$32$步下采样，最终得到的feature的尺寸为$13 \times 13$）这样对公式(38–41)稍加修改，就可以将边界框相对于整张图片的位置和大小计算出来(四个值均在$0-1$之间)
$$
\begin{align}
b_x&=(\sigma (t_x)+c_x)/W\
b_y&=(\sigma(t_y)+c_y)/H\
b_w&=(p_w \exp(t_w))/W \
b_h&=(p_h \exp (t_h))/H
\end{align}
$$
将上述的$4$个值分别乘以图片的宽度和高度就可以得到边界框的最终位置和大小了，这就是YOLOv2的解码过程。
设计了新的basenet-DarkNet19
- 主要针对vgg的基础网络计算量过大的问题30.69Billion，新设计的网络在不降低精度的前提下计算量为5.8Billion

Fine-Grained Features-小物体检测
- 为了实现弥补YOLOv1中对小物体检测准确率较低的问题，在YOLOv2中进行了改进设计，初始输入图像尺寸为$416\times 416$，经过DarkNet下采样之后得到的特征图尺寸为$13 \times 13$，对于检测较大的物体这个尺寸是够的，但是对于小的物体，希望使用浅层的feature进行检测，浅层feature由于具有更高的分辨率，更利于小物体的检测。在网络设计中YOLOv2提出了一个passthrough层，类似于resnet中的shortcut或者FPN网络中的skip-connection，YOLOv2将$26 \times 26 \times 512$的特征图连接到最后一层$13 \times 13 \times 1024$的特征图上。passthrough层对$26 \times 26\times 512 $特征图分别按行和列进行隔点采样，得到4个$13 \times 13 \times 512$的特征图，然后把这四个特征图在维度方向上连接起来，由此将$26 \times 26 \times 512$ 转换为$ 13 \times 13 \times 2048$，特征图的大小降低了$4$倍，通道数增加了$4$倍，这样就可以和最后的$ 13 \times 13 \times 1024$的特征度进行连接，最终得到了$13 \times 13 \times 3072$的特征图，并在这个基础上进行卷积预测。

YOLOv3

Reference

终于来到了YOLOv3，在YOLOv3中对于小目标的识别，提升是非常明显的，在此处盗图一张，来对比YOLOv2和YOLOv3，可以明显的看到，YOLOv3对小物体的识别提升

在目标检测领域，对于重叠目标检测是很困难的，在这个方面YOLOv3相比于YOLOv2同样提升明显

按照论文思路及个人理解对YOLOv3给出下面一些分析，在YOLOv2的基础上尝试了一些Trick，

考虑到检测物体的重叠情况，用多标签的方式替代了之前softmax单标签方式；
骨干架构使用了更为有效的残差网络，网络深度也更深；
多尺度特征使用的是FPN的思想；
锚点聚类成了9类。

yolov3模型结构如下所示

对应的模型参数如下

下面进行分点说明：

多标签任务

在检测任务中，对于目标之间相互重叠的场景是不可避免点，一个区域内可能会包含多个不同的物体，在以往的目标检测网络中选择和锚点IoU最大的Ground Truth作为匹配类别，使用softmax作为激活函数，这是基于一个区域内包含一个物体的假设。但是这和目标检测任务中常见的场景是不符合的。

为了解决这一问题，在YOLOv3中提出了多标签的概念，也就是将原来的softmax函数改成了logistics函数，使用sigmoid分类器。原来每个分类器使用softmax，会使得每个类别的输出结果控制在总和在$[0,1]$之间，并且所有分类结果之和为1；修改为sigmoid之后，每个类别的得分仍然在$[0,1]$之间，但是所有分类结果之和不再是1。

虽然YOLOv3改变了输出层的激活函数，但是其锚点和Ground Truth的匹配方法仍旧采用的是YOLOv1 的方法，即每个Ground Truth匹配且只匹配唯一一个与其IoU最大的锚点。但是在输出的时候由于各类的概率之和不再是1，只要置信度大于阈值，该锚点便被作为检测框输出。

YOLOv3多标签模型的提出，对于解决覆盖率高的图像的检测问题效果是十分显著的，就如同上面图片所显示的对于重叠目标的检测，YOLOv3不仅检测的更准确而且检测到了更多的重叠人物。
basenet

YOLOv3使用ResNet的残差模块构成的全卷积网络作为主干网络，整个网络层为53层，网络结构图如下所示

多尺度检测

在YOLOv2中，使用了一个passthrough层来实现小目标检测，在YOLOv3中，受FPN的启发，使用了多尺度预测，使用最后三个尺寸的feature layer层进行预测，并且每个cell预测三个boundingbox。

初始图片输出尺寸为$416 \times 416$，用于预测的feature layer的尺寸为$[13 \times 13, 26 \times 26, 52 \times 52]$，首先在$13 \times 13$尺寸上使用conv1x1进行预测，在每个cell上预测3个boundingbox，之后使用$13 \times 13$上采样至尺寸$26 \times 26$并与下采样的$26 \times 26$的feature layer进行merge，作为预测层，同样使用conv1x1预测，之后再对该层进行上采样并与下采样的$52 \times 52$的feature layer进行merge之后预测。整个网络的boundingbox的数量为
$$
(13 \times 13+26 \times 26+52 \times 52) \times 3=10647
$$

在每层的检测数量为$1 \times 1 \times (B \times (5 +C))$,其中$B$对应着每个cell上的预测框的数量，“5”对应着4个预测框和1个目标置信度得分，$C$为检测任务对应的目标种类。每个cell的预测输出的维度中的排列顺序如下所示

每一个boundingbox的向量信息为$[center_{x},center_{y},width,height,object_{score},cls_{1},cls_{2},…cls_{C}]$

锚点聚类

在YOLOv2中使用的是聚类的方法得到5个anchor，在YOLOv3中使用了9组锚点，模型中的输入尺寸为$416 \times 416$，经过下采样之后，对应的图像尺寸被缩小的幅度为$32,16,8$，也就是下采样至的图像尺寸为$13\times 13$,$26 \times 26$,$52 \times 52$，对应的anchor的尺寸为
- $13 \times 13$ 尺寸：$116 \times 90,::: 156 \times 198, ::: 373 \times 326 $
- $26 \times 26 $ 尺寸：$30 \times 61 ,::: 62 \times 45, ::: 59 \times 119 $
- $52 \times 52$ 尺寸：$10 \times 13,::: 16 \times 30, ::: 33 \times 23 $
可以看到，在大的featurelayer检测设置的anchor较小，用来对小目标进行检测。

总结

从YOLOv1到YOLOv3，首先实现了速度非飞跃，之后引入了RPN中的锚点机制，提高准确度，参考FPN的多尺度机制，提高了对小目标的检测效果，至此实现了对RCNN系列的碾压。YOLO因为强大的性能优势，在工业界具有很大的应用场景。貌似现在作者的兴趣点转向了GAN，应该YOLO系列短期不会有更大的更新了。