This note is based on a great survey


Texture Representation

  • Contrast to object classification, the global spatial information is less important.

Bag-of-Word based methods (BoW)

BoW pipeline

  1. Local Patch Extraction
  2. Local Patch Representation (Feature Descriptors)
    • Ideal: distinctive, robust to variations
  3. Codebook Generation
    • Find a set of prototype features from training data.
    • Like words, phrases in languages. (Similar to dimension reduction)
  4. Feature Encoding
    • Assign local representation to some prototype features.
  5. Feature Pooling
  6. Feature Classification

Convolutional Neural Network based methods

Pre-trained Generic CNN models
  • CNN = convolution + non-linear activation + pooling
    • Related to LBP, Random Projection (RP), etc.
    • CNN-extracted features can be encoded by BoW-based method.
Fine-tuned CNN models

Fine-tuned CNN

  • Texture CNN (T-CNN)
    • Energy Layer: average activation ouput for each feature map of the last conv layers.
      • 1 value per feature map
      • Similar to energy response of a filter bank.
      • Example: 256x27x27 (channel x height x width) \(\rightarrow\) 256x1
    • Concat: GlobalAveragePooling(intermediate conv.) and last conv.
    • Insight:
      • Fine-tune a texture-centric pretrained network performs better than that pretrained with object-centric dataset.
  • Bilinear CNN (BCNN)
    • Replace FC layers with orderless bilinear pooling layer. (Matrix outer product + aveage pooling)
      • \(f(l, I)\): Feature function, l = locations, I = image
      • \[f: L \times I \rightarrow R^{K\times D}\]
    • Cost: High dimenstional features \(\rightarrow\) need lots of training data
\[bilinear(l, I, f_A, f_B) = f_A(l, I)^{T}f_B(l, I)\]
  • Compact BCNN
    • To reduce the BCNN feature dimensions: Approximate the inner product of 2 bilinear features.
      • Random Maclaurin Projection or Tensor Sketch Projection
    • To reduce time, represent the bilinear features as a matrix and applied a low rank bilinear classifier
  • FASON (First And Second Order information fusion Network)
    • T-CNN + Compact BCNN
  • NetVLAD
    • VLAD = Vector of Locally Aggregated Descriptors
\[V(j, k) = \sum{N}{i=1}a_k(x_i)(x_i(j) - c_k(j))\]
  • Deep Texture Encoding Network (DeepTEN)

  • Residual encoding layer:
    • Each input \(x_i\) is assigned a weight to the codewords.
    • Contrast to soft-weight assignment, the smoothing factor for each cluster is learnable
  • Residual encoding: (Output = \(e_k\) for each codeword \(c_k\)):
\[r_{ik} = x_i - c_k\] \[e_k = \sum{N}{i=1}e_{ik} = \sum{N}{i=1}a_{ik}r_{ik}\]
  • Soft-weight assignment (\(a_{ik}\))
    • \(\beta\) is the smoothing factor
    • \(\beta \rightarrow s_k\) where \(s_k\) is learnable
\[a_{ik} = \frac{exp(-\beta \|r_{ik}\|^2)}{\sum{K}{j=1}exp(-\beta \|r_{ik}\|^2)}\]
Texture-specific CNN models

Texture-specific CNN

  • ScatNet
    • Pre-determined convolution layers (Example: Haar, Gabor wavelets)
      • Translation-invariant, also extends to rotation and scale invariance.
      • No need to learn, but expensive when extracting features.
    • Explore theoretical aspect of CNN.
  • PCANet
    • Use trained PCA filters.
    • Variations: RandNet, LDANet
    • Faster feature extraction than ScatNet. Weaker invariance and performance.

Attribute based methods

  • Allows more detail description of an image:
    • Like: spotted, striated, striped, etc.
  • Issues:
    • Need a unified vocabulary for describing texture attribure.
    • Benchmark dataset annotated by semantic attriburte.
  • Studies that characterized texture attribute:
    1. Tamura et al.: coarseness, contrast, directionality, line-likeness, regularity and roughness
    2. Amadasun and King: coarseness, contrast, business, complexity, and strength (refine 1.)
    3. Matthews et al.: 11 commonly-used attributes by using a single adjective. (Relative comparison)

Texture Datasets

Summary