Convolutional Neural Network

This is a note of CNN model architectures.

For what is Convolution Neural Network (CNN), I recommend you to first read this.

Image Classification

There are pre-trained models trained on ImageNet in keras.applications.
Pre-trained models are useful for transfer learning.

Inception (GoogLeNet)

Inception

Inception module: Various sizes of feature maps in parallel.
- Useful in localization
- Main Idea:
  
  Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.
Use 1x1 convolution to reduce dimension.

InceptionV2

Factorized Convolution	Assymetric Convolution

General Design Principles:
1. Avoid representational bottlenecks: avoid extreme compression to info. loss
2. Increase the activation per tile in CNN: train faster
3. Spatial aggregation on lower dim embed: Maybe adjacent units have strong correlation.
4. Balance the width and the depth of the network.
Factorization of convolutions:
- Replace 5x5 conv by two 3x3 convs. (Exploit translation invariance again)
- Replace 3x3 conv by 3x1 and 1x3 convs. (nxn \(\rightarrow\) nx1 and 1xn)
- Works well on m x m feature maps (\(m = 12\sim 20\))
Efficiently reduce grid size:
- Convolution and Pooling done seperately and concat back.
Label-smoothing regularization (LSR): Avoid model to assign labels too confidently
- x / y = training example / label
- k = labels, u(k) = a fixed distribution (e.g. u(k) = 1/K)
- Original label distribution: \(q(k\|x) = \delta_{k,y}\)
- Smoothed label distribution: \(q'(k) = (1 - \epsilon)\delta_{k,y} + \epsilon u(k)\)
  - Cross-entropy loss = \(H(q', p) = (1 - \epsilon)H(q, p) + \epsilon H(u, p)\)
  - \(H(u, p)\) can be captured by KL divergence.

ResNet

Residual Network

Residual operation for reusing feature maps.
Motivation: Deeper network produces higher training error \(\rightarrow\) Degradation.
Core building block: (\(W_s\) is a linear projection to match dimensions) \(y = F(x, {W_i}) + W_sx\)

DenseNet

Github
Densely-connected convolutional layers:
- Receive concatenation of feature maps from all preceding layers.
- For \(l^{th}\) layer, the output \(x_l=H_l([x_0, x_1,..., x_{l-1}])\)
- Cannot perform pooling in Dense-connected blocks.
Advantages:
- Universal access of feature maps within Dense blocks.
- Can reduce number of feature maps by 1x1 conv between Dense blocks.
- Deep supervision, Feature reusage, Network compression.

MobileNet

Essence:

The MobileNet model is based on depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a 1×1 convolution called a pointwise convolution.
Seprate standard convolutional layer into 2 layers (With BatchNormalization and ReLU for layer outputs)
Let:
- \(D_{F/G}\) = spatial dimention of feature map \(F/G\)
- \(D_K\) = spatial dimention of kernel \(K\)
- \(M\) = number of input channel, \(N\) = number of output channel
Standard Convolution
- Assuming stride=1 and padding \(G_{k,l,n} = \sum_{i,j,m}{K_{i,j,m,n}F_{ {k+i-1},{l+j-1},m} }\) \(i=row, j=col, m=input\space channel, n=output\space channel\)
- Parameter size of \(K = D_K\times D_K\times M\times N\)
- Computation cost = \(D_K\cdot D_K\cdot M\cdot N\cdot D_F\cdot D_F\)
Depthwise separable Convolution = Depthwise & Pointwise convolution
- Depthwise convolutions: 1 filter per input channel \(\hat{G}_{k,l,m} = \sum_{i,j}{\hat{K}_{i,j,m}F_{ {k+i-1},{l+j-1},m} }\)
  - Parameter size of \(\hat{K} = D_K\times D_K\times M\)
  - Computation cost = \(D_K\cdot D_K\cdot M\cdot D_F\cdot D_F\)
- Pointwise convolutions = 1x1 convolution
  - Computation const = \(M\cdot N\cdot D_F\cdot D_F\)
- Further reduction of cost:
  - Width parameter: \(0 < \alpha <= 1\)
  - Resolution parameter: \(0 < \rho < =1\)
  - Computation const = \(D_K\cdot D_K\cdot \alpha M\cdot \rho D_F\cdot \rho D_F + \alpha M\cdot \alpha N\cdot \rho D_F\cdot \rho D_F\)

MobileNet_v2

Overall Architecture	Inverted Residual

Main contribution: inverted residual with linear bottleneck
Linear bottleneck:
- Assume the “manifold of interest” lies in a low-dimensional subspace of the input space.
Bottleneck Residual Block (\(F(x) = [A\circ N\circ B]x\)): Composition of 3 operators
- Linear transformation: \(A: R^{s\times s\times k} \rightarrow R^{s\times s\times n}\)
- Non-linear per channel: \(N: R^{s\times s\times n} \rightarrow R^{ {s}'\times {s}'\times n}\)
  - In this paper: \(N = ReLU6\circ dwise\circ ReLU6\)
- Linear transformation: \(B: R^{ {s}'\times {s}'\times n} \rightarrow R^{ {s}'\times {s}'\times {k}'}\)
- Represent inner tensor \(I\) as concatenation of \(t\) tensors of size \(n/t\) \(F(x) = \sum^t_{i=1}(A_i\circ N\circ B_i)(x)\)
- Get improvement because of
  1. Inner transformation is per-channel.
  2. Consecutive non-per-channel operators have significanct \(\frac{input\space size}{output\space size}\).
  3. Recommended \(t = 2\sim 5\) for avoiding cache miss in matrix multiplication.

Neural Arichitecture Search (NASNet)

NASNet

Search architecture for building blocks on smaller dataset and transfer to larger dataset.
Use Reinforcement Learning to search building blocks.
Two main blocks (Cells):
1. Normal Cell: convolutional cell that return the same dimension.
2. Reduction Cell: Heigh and width is reduced by factor of 2.
Operations: Use common operation like conv, max-pooling, depthwise-seperable conv.
Combination of 2 hidden states:
1. Element-wise addition
2. Concatenation
Controller RNN: one-layer LSTM with softmax predictions for each decision.

Object Detection & Segmentation Using CNN

For object detection & segmentation tasks, models need to output more fine-grained details.
Thus, many other architectures and training schema for more efficiency were proposed.

What is mAP?

Region-based CNN (R-CNN)

Flow: region proposals \(\rightarrow\) CNN feature extraction \(\rightarrow\) Image classification \(\rightarrow\) BBOX regression
Region proposals: category-independent methods
- Examples: objectness, selective search, …etc.
CNN feature extraction: this paper use OxfordNet and TorontoNet
- Pre-trained on ILSVRC2012 (image classification labels)
- Fine-tuning on warped region proposals
  - Treat all region proposals with >= 0.5 IoU as positive
  - Add 1 additional background class
Image classification: Use class-specific linear SVMs
- Grid search the overlap threshold (\(IoU = 0.1\sim 0.5\))
Bounding Box regression:
- Linear regression with ridge regularization.

Spatial pyramid pooling networks (SPPnets):

SPPnet

To remove fixed size constraint in the convolution parts.
Spatial Pyramid Pooling (SPP):
- Partition images into various size and aggregate them.
SPPnet:
- Replace the last pooling layer with SPP layer.
- SPP layer outputs: \(kM\)-dimensional vectors
  - \(M\) = the number of spatial bins
  - \(k\) = the number of filters in the last convolutional layer

Fast R-CNN

FastR-CNN

RoI pooling layer: Only one pyramind level in SPPnets
Initialization: pre-trained ImageNet network with 3 modification
- Replaced last max-pooling by RoI pooling layer.
- Add a FC and softmax over K + 1 classes and bbox regressors.
- Modify input into: a list of images and RoIs for each images.
Comparison with R-CNN and SPPnet:
- Fast R-CNN: For each batch, use RoIs from small set of images.
- End-to-End training for fine-tuning, classification, bbox regression.
Multi-task loss:

\[L(p,u,t^u,v) = L_{cls}(p,u) + \lambda [u\geq 1]L_{loc}(t^u,v)\]

\([u\geq 1]\) = 1 if \(u\geq 1\) else 0 (background class: u = 0)
\(L_{cls}(p,u) = -logp_u\) is the log loss for true class \(u\)
\[L_{loc}(t^u,v) = \sum_{i\in {x,y,w,h} } smooth_{L_1}(t^u-v_i)\]
- \(L_1\) loss: less sensitive to outliers and prevent exploding gradient.

\[smooth_{L_1}(x) = \begin{matrix} 0.5x^2 & if |x| < 1 \newline |x| - 0.5 & otherwise \end{matrix}\]

Speed up for detection using Truncated SVD
- Motivation: Slow when many RoI vectors forward-pass fully-connected layers (\(W\)).
- Factorize \(W\) as \(W \approx U\sum_t V^T\) (\(U: u\times t\), \(\sum_t: t\times t\), \(V^T: v\times t\))

Faster R-CNN (Region Proposal Network(RPN))

FasterR-CNN

Motivation: Region proposals cause computation bottleneck
- Faster R-CNN = RPN + Fast R-CNN
RPN: 2 additional conv layers (a kind of FCN)
1. Encodes each position in feature map into a fixed-length vector. (nxn conv.)
2. Score each vector an objectness score and regressed bbox for \(k\) region proposals of various scale. (1x1 conv.)
Translation-Invariant Anchors: a set of pre-defined bboxes
- Each anchor can have different scales and aspect ratios
  - Here they use 3 scales and 3 aspect ratios, resulting in 9 anchors at each position
Loss function for learning region proposals
- Labels for each anchor (object or background):
  - Positive:
    1. Highest IoU with ground-truth bboxes.
    2. IoU with ground truth \(\geq\) threshold (e.g. 0.7)
  - Negative: IoU \(\leq\) threhold (e.g. 0.3) with all ground-truth bboxes.
- Multi-task loss: \(L({p_i},{t_i}) = \frac{1}{N_cls} \sum_i L_{cls}(p_i,p_i^*) + \lambda \frac{1}{N_reg} \sum_i p_i^*L_{reg}(t_i,t_i^*)\)
  - \(p_i\) = predicted probability of anchor \(i\) being an object. (\(p_i^*\) = true label)
  - \(t_i\) = a vector of bbox coordinates. (\(t_i^*\) = true coordinates)
  - \(L_{cls}(p_i,p_i^*)\) = log loss over 2 classes
  - \(L_{reg}(t_i,t_i^*)\) = smoothed L1-loss in Fast-RCNN
Share conv features between RPN and Fast R-CNN: 4-step training
1. Train RPN (with ImageNet-pre-trained model and fine-tune end-to-end)
2. Train Fast R-CNN using proposals from 1.
3. Initialze RPN and fine-tune RPN with fixed shared conv layers.
4. Fine-tune Fast R-CNN with fixed shared conv layers.

Mask R-CNN

Add mask branch (FCN) for image segmentation into Faster R-CNN.
Multi-task loss for each RoI: \(L = L_{cls} + L_{box} + L_{mask}\)
- \(L_{cls},L_{box}\) are the same as Faster R-CNN
- \(L_{mask}\) is the average binary cross-entropy loss of pixel-wise masks.
- \(K\) binary masks of resolution \(m x m\) for \(K\) classes.
- Decouple classification and segmentation by per-pixel sigmoid and binary loss.
To fix misalignment, a quantization-free layer: RoIAlign
- Use bilinear interpolation to compute exact values of input feature.
- No quantization is the most crucial point.
Network:
- Backbone: for feature extraction, e.g. ResNet, FPN
- Head: for regression, classification and mask prediction, e.g. FCN

Feature Pyramid Networks (FPN)

Main Goal:

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.
To construct multi-scale feature maps for downstream tasks.
- Utilize each level of conv-layers to produce multi-scale feature maps.
- Each level of feature maps = Upsampling(Higher level) + Conv1x1(Current level)
Applications:
- For Region-proposal Network (RPN):
  - Attach the RPN head (3x3 conv and 2 1x1 convs) to each level of feature pyramid.
  - Each anchor has single scale, multiple aspect ratios for each level.
- For Fast R-CNN:
  - Assign an RoI of width \(w\) and height \(h\) to level \(P_k\) by: \(k = \lfloor{k_0 + log_2(\sqrt{wh}/224)}\rfloor\)
    - \(k_0\) = target level that RoI mapped into
    - 224 is the canonical ImageNet pre-training size
  - Attach predictor heads to all RoIs of all levels.