This is a note of CNN model architectures.

  • For what is Convolution Neural Network (CNN), I recommend you to first read this.

Image Classification

  • There are pre-trained models trained on ImageNet in keras.applications.
  • Pre-trained models are useful for transfer learning.

Inception (GoogLeNet)

Inception

  • Inception module: Various sizes of feature maps in parallel.
    • Useful in localization
    • Main Idea:

      Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.

  • Use 1x1 convolution to reduce dimension.
InceptionV2
Factorized Convolution Assymetric Convolution
inceptionV2_block inceptionV2_block2
  • General Design Principles:
    1. Avoid representational bottlenecks: avoid extreme compression to info. loss
    2. Increase the activation per tile in CNN: train faster
    3. Spatial aggregation on lower dim embed: Maybe adjacent units have strong correlation.
    4. Balance the width and the depth of the network.
  • Factorization of convolutions:
    • Replace 5x5 conv by two 3x3 convs. (Exploit translation invariance again)
    • Replace 3x3 conv by 3x1 and 1x3 convs. (nxn \(\rightarrow\) nx1 and 1xn)
    • Works well on m x m feature maps (\(m = 12\sim 20\))
  • Efficiently reduce grid size:
    • Convolution and Pooling done seperately and concat back. reduce_grid
  • Label-smoothing regularization (LSR): Avoid model to assign labels too confidently
    • x / y = training example / label
    • k = labels, u(k) = a fixed distribution (e.g. u(k) = 1/K)
    • Original label distribution: \(q(k\|x) = \delta_{k,y}\)
    • Smoothed label distribution: \(q'(k) = (1 - \epsilon)\delta_{k,y} + \epsilon u(k)\)
      • Cross-entropy loss = \(H(q', p) = (1 - \epsilon)H(q, p) + \epsilon H(u, p)\)
      • \(H(u, p)\) can be captured by KL divergence.

ResNet

Residual Network

  • Residual operation for reusing feature maps.
  • Motivation: Deeper network produces higher training error \(\rightarrow\) Degradation.
  • Core building block: (\(W_s\) is a linear projection to match dimensions) \(y = F(x, {W_i}) + W_sx\)

DenseNet

img

  • Github
  • Densely-connected convolutional layers:
    • Receive concatenation of feature maps from all preceding layers.
    • For \(l^{th}\) layer, the output \(x_l=H_l([x_0, x_1,..., x_{l-1}])\)
    • Cannot perform pooling in Dense-connected blocks.
  • Advantages:
    • Universal access of feature maps within Dense blocks.
    • Can reduce number of feature maps by 1x1 conv between Dense blocks.
    • Deep supervision, Feature reusage, Network compression.

MobileNet

img

  • Essence:

    The MobileNet model is based on depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a 1×1 convolution called a pointwise convolution.

  • Seprate standard convolutional layer into 2 layers (With BatchNormalization and ReLU for layer outputs)

  • Let:
    • \(D_{F/G}\) = spatial dimention of feature map \(F/G\)
    • \(D_K\) = spatial dimention of kernel \(K\)
    • \(M\) = number of input channel, \(N\) = number of output channel
  • Standard Convolution
    • Assuming stride=1 and padding \(G_{k,l,n} = \sum_{i,j,m}{K_{i,j,m,n}F_{ {k+i-1},{l+j-1},m} }\) \(i=row, j=col, m=input\space channel, n=output\space channel\)

    • Parameter size of \(K = D_K\times D_K\times M\times N\)
    • Computation cost = \(D_K\cdot D_K\cdot M\cdot N\cdot D_F\cdot D_F\)
  • Depthwise separable Convolution = Depthwise & Pointwise convolution
    • Depthwise convolutions: 1 filter per input channel \(\hat{G}_{k,l,m} = \sum_{i,j}{\hat{K}_{i,j,m}F_{ {k+i-1},{l+j-1},m} }\)
      • Parameter size of \(\hat{K} = D_K\times D_K\times M\)
      • Computation cost = \(D_K\cdot D_K\cdot M\cdot D_F\cdot D_F\)
    • Pointwise convolutions = 1x1 convolution
      • Computation const = \(M\cdot N\cdot D_F\cdot D_F\)
    • Further reduction of cost:
      • Width parameter: \(0 < \alpha <= 1\)
      • Resolution parameter: \(0 < \rho < =1\)
      • Computation const = \(D_K\cdot D_K\cdot \alpha M\cdot \rho D_F\cdot \rho D_F + \alpha M\cdot \alpha N\cdot \rho D_F\cdot \rho D_F\)
MobileNet_v2
Overall Architecture Inverted Residual
MobileNetV2 invertedResidual
  • Main contribution: inverted residual with linear bottleneck
  • Linear bottleneck:
    • Assume the “manifold of interest” lies in a low-dimensional subspace of the input space.
  • Bottleneck Residual Block (\(F(x) = [A\circ N\circ B]x\)): Composition of 3 operators
    • Linear transformation: \(A: R^{s\times s\times k} \rightarrow R^{s\times s\times n}\)
    • Non-linear per channel: \(N: R^{s\times s\times n} \rightarrow R^{ {s}'\times {s}'\times n}\)
      • In this paper: \(N = ReLU6\circ dwise\circ ReLU6\)
    • Linear transformation: \(B: R^{ {s}'\times {s}'\times n} \rightarrow R^{ {s}'\times {s}'\times {k}'}\)
    • Represent inner tensor \(I\) as concatenation of \(t\) tensors of size \(n/t\) \(F(x) = \sum^t_{i=1}(A_i\circ N\circ B_i)(x)\)
    • Get improvement because of
      1. Inner transformation is per-channel.
      2. Consecutive non-per-channel operators have significanct \(\frac{input\space size}{output\space size}\).
      3. Recommended \(t = 2\sim 5\) for avoiding cache miss in matrix multiplication.

Neural Arichitecture Search (NASNet)

NASNet

  • Search architecture for building blocks on smaller dataset and transfer to larger dataset.
  • Use Reinforcement Learning to search building blocks.

  • Two main blocks (Cells):
    1. Normal Cell: convolutional cell that return the same dimension.
    2. Reduction Cell: Heigh and width is reduced by factor of 2.
  • Operations: Use common operation like conv, max-pooling, depthwise-seperable conv.
  • Combination of 2 hidden states:
    1. Element-wise addition
    2. Concatenation
  • Controller RNN: one-layer LSTM with softmax predictions for each decision. Controller

Object Detection & Segmentation Using CNN

  • For object detection & segmentation tasks, models need to output more fine-grained details.
  • Thus, many other architectures and training schema for more efficiency were proposed.

What is mAP?

Region-based CNN (R-CNN)

img

  • Flow: region proposals \(\rightarrow\) CNN feature extraction \(\rightarrow\) Image classification \(\rightarrow\) BBOX regression
  • Region proposals: category-independent methods
    • Examples: objectness, selective search, …etc.
  • CNN feature extraction: this paper use OxfordNet and TorontoNet
    • Pre-trained on ILSVRC2012 (image classification labels)
    • Fine-tuning on warped region proposals
      • Treat all region proposals with >= 0.5 IoU as positive
      • Add 1 additional background class
  • Image classification: Use class-specific linear SVMs
    • Grid search the overlap threshold (\(IoU = 0.1\sim 0.5\))
  • Bounding Box regression:
    • Linear regression with ridge regularization.
Spatial pyramid pooling networks (SPPnets):

SPPnet

  • To remove fixed size constraint in the convolution parts.
  • Spatial Pyramid Pooling (SPP):
    • Partition images into various size and aggregate them.
  • SPPnet:
    • Replace the last pooling layer with SPP layer.
    • SPP layer outputs: \(kM\)-dimensional vectors
      • \(M\) = the number of spatial bins
      • \(k\) = the number of filters in the last convolutional layer
Fast R-CNN

FastR-CNN

  • RoI pooling layer: Only one pyramind level in SPPnets
  • Initialization: pre-trained ImageNet network with 3 modification
    • Replaced last max-pooling by RoI pooling layer.
    • Add a FC and softmax over K + 1 classes and bbox regressors.
    • Modify input into: a list of images and RoIs for each images.
  • Comparison with R-CNN and SPPnet:
    • Fast R-CNN: For each batch, use RoIs from small set of images.
    • End-to-End training for fine-tuning, classification, bbox regression.
  • Multi-task loss:
\[L(p,u,t^u,v) = L_{cls}(p,u) + \lambda [u\geq 1]L_{loc}(t^u,v)\]
  • \([u\geq 1]\) = 1 if \(u\geq 1\) else 0 (background class: u = 0)
  • \(L_{cls}(p,u) = -logp_u\) is the log loss for true class \(u\)
  • \[L_{loc}(t^u,v) = \sum_{i\in {x,y,w,h} } smooth_{L_1}(t^u-v_i)\]
    • \(L_1\) loss: less sensitive to outliers and prevent exploding gradient.
\[smooth_{L_1}(x) = \begin{matrix} 0.5x^2 & if |x| < 1 \newline |x| - 0.5 & otherwise \end{matrix}\]
  • Speed up for detection using Truncated SVD
    • Motivation: Slow when many RoI vectors forward-pass fully-connected layers (\(W\)).
    • Factorize \(W\) as \(W \approx U\sum_t V^T\) (\(U: u\times t\), \(\sum_t: t\times t\), \(V^T: v\times t\))
Faster R-CNN (Region Proposal Network(RPN))

FasterR-CNN

  • Motivation: Region proposals cause computation bottleneck
    • Faster R-CNN = RPN + Fast R-CNN
  • RPN: 2 additional conv layers (a kind of FCN)
    1. Encodes each position in feature map into a fixed-length vector. (nxn conv.)
    2. Score each vector an objectness score and regressed bbox for \(k\) region proposals of various scale. (1x1 conv.)
  • Translation-Invariant Anchors: a set of pre-defined bboxes
    • Each anchor can have different scales and aspect ratios
      • Here they use 3 scales and 3 aspect ratios, resulting in 9 anchors at each position
  • Loss function for learning region proposals
    • Labels for each anchor (object or background):
      • Positive:
        1. Highest IoU with ground-truth bboxes.
        2. IoU with ground truth \(\geq\) threshold (e.g. 0.7)
      • Negative: IoU \(\leq\) threhold (e.g. 0.3) with all ground-truth bboxes.
    • Multi-task loss: \(L({p_i},{t_i}) = \frac{1}{N_cls} \sum_i L_{cls}(p_i,p_i^*) + \lambda \frac{1}{N_reg} \sum_i p_i^*L_{reg}(t_i,t_i^*)\)
      • \(p_i\) = predicted probability of anchor \(i\) being an object. (\(p_i^*\) = true label)
      • \(t_i\) = a vector of bbox coordinates. (\(t_i^*\) = true coordinates)
      • \(L_{cls}(p_i,p_i^*)\) = log loss over 2 classes
      • \(L_{reg}(t_i,t_i^*)\) = smoothed L1-loss in Fast-RCNN
  • Share conv features between RPN and Fast R-CNN: 4-step training
    1. Train RPN (with ImageNet-pre-trained model and fine-tune end-to-end)
    2. Train Fast R-CNN using proposals from 1.
    3. Initialze RPN and fine-tune RPN with fixed shared conv layers.
    4. Fine-tune Fast R-CNN with fixed shared conv layers.
Mask R-CNN

Mask R-CNN

  • Add mask branch (FCN) for image segmentation into Faster R-CNN.

  • Multi-task loss for each RoI: \(L = L_{cls} + L_{box} + L_{mask}\)
    • \(L_{cls},L_{box}\) are the same as Faster R-CNN
    • \(L_{mask}\) is the average binary cross-entropy loss of pixel-wise masks.
    • \(K\) binary masks of resolution \(m x m\) for \(K\) classes.
    • Decouple classification and segmentation by per-pixel sigmoid and binary loss.
  • To fix misalignment, a quantization-free layer: RoIAlign
    • Use bilinear interpolation to compute exact values of input feature.
    • No quantization is the most crucial point.
  • Network:
    • Backbone: for feature extraction, e.g. ResNet, FPN
    • Head: for regression, classification and mask prediction, e.g. FCN

Feature Pyramid Networks (FPN)

img

  • Main Goal:

    The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales.

  • To construct multi-scale feature maps for downstream tasks.
    • Utilize each level of conv-layers to produce multi-scale feature maps.
    • Each level of feature maps = Upsampling(Higher level) + Conv1x1(Current level)
  • Applications:
    • For Region-proposal Network (RPN):
      • Attach the RPN head (3x3 conv and 2 1x1 convs) to each level of feature pyramid.
      • Each anchor has single scale, multiple aspect ratios for each level.
    • For Fast R-CNN:
      • Assign an RoI of width \(w\) and height \(h\) to level \(P_k\) by: \(k = \lfloor{k_0 + log_2(\sqrt{wh}/224)}\rfloor\)
        • \(k_0\) = target level that RoI mapped into
        • 224 is the canonical ImageNet pre-training size
      • Attach predictor heads to all RoIs of all levels.

head

Fully Connected CNN (FCN)

FCN


Others: Maybe next time

Multi-function CNN for Medical image classification