Paper Study By Amazon Li Mu and Zhu

NLP

(2019 Google AI) BERT - Pre-training of Bidirectional Transformer

BERT 论文逐段精读【论文精读】 - YouTube

Model: Transformer (Encoder)
Training: SSL and bidirectional (mask and consecutive sentences)
Fine-tuning: full fine-tune

image-20230408233237349

(2022 OpenAI) Whisper for multilingual speech recognition

OpenAI Whisper 精读【论文精读】 - YouTube

Model: Transformer (audio encoder+ text decoder)
Training: Weakly Supervised Learning

image-20230408234929532

Summary
  • 高端的食材,只需要最朴素的烹饪
  • Big weak supervised learning
  • Model 就是 native transformer : input Mel spectrum, output: script
  • Pro: ROBUST and no fine-tuning needed, SOTA on many tasks
  • Con: big computation.

(2022 Microsoft) Neural Corpus Indexer (NCI)

Neural Corpus Indexer 文档检索【论文精读】 - YouTube

Model: Transformer (encoder+decoder)
Training: Supervised Learning

image-20230409002128478

Summary
  • Semantic search using neural network (end-to-end)
  • Reranking 本來就是 neural network
  • Query generation 很重要

Computer Vision

(2021 Google) ViT: An image is worth 16x16 words

ViT論文逐段精讀【論文精讀】 - YouTube

Model: Transformer (Encoder)
Training: Supervised Learning, but touching SSL in last section

image-20230408111038223

(2021 Microsoft MSRA) SWIN - Shifted WINdow Transformer for CV

Swin Transformer論文精讀【論文精讀】 - YouTube

Model: Transformer with hierarchy (Encoder)
Training: Supervised Learning

image-20230408235845965

Summary and contribution

Solve two problems of ViT

  • 不同尺度問題: hierarchy
  • 計算量問題 :shifted window (local)

Key to success from CNN

  • Window attention + shifted window attention 取代 global attention!

  • 用 patch merge 取代 pooling!

(2022 Meta, MAIR) MAE: Masked Auto-Encoder: scalable self-supervised learning for vision

MAE 论文逐段精读【论文精读】 - YouTube

Model: Vision Transformer (ViT encoder + transformer decoder + linear projection)
Training: SSL (mask auto-encoder)
Fine-tuning: full fine-tune and linear projection (last layer)

image-20230408215247340

Summary and contribution
  • Transformer -> BERT -> ViT (CV) -> MAE (CV BERT, MLP)

  •   NLP Computer Vision
    First use transformer
    but supervised learning
    Transformer ViT
    First use self-supervised learning
    and achieve good results
    BERT MAE
    First use self-supervised learning
    on large model (>1B)
    GPT-2 (1.5B?)
    auto-regression
    MAE on ViT-Huge (?)
    auto-encoder

image-20230408191908270

  • 這裏的 AE 只是强調 auto + encoder, 和傳統的 MLP auto-encoder 或是 convolution auto-encoder 完全沒有關係! Encoder 仍然是 transformer 的架構!
  • MAE is based on ViT which is based on transformer
  • Decoder 是 MLP? NO! 還是一個 transformer block. 只是 encoder and decoder 是非對稱!因爲 encoder 只處理 unmasked patches (只有 25% of image). Decoder 要處理全部 patches (input with masked patches and output restored patches), but decoder size is only 10% of the encoder!!

  • Use MAE on ViT-Huge on ImageNet-1K can achieve 87.8%, far better than the ViT paper self-supervised result on ViT-B model (80%)!

  • ViT-22B 使用 supervised or self-supervised?

(2023 Amazon) AIM: Adapting Image Model for Video Recognition

大模型时代下做科研的四个思路【论文精读·52】 - YouTube

Model: Vision Transformer (ViT)
Training: Pre-trained ViT (frozen)
Fine-tuning: Adapter (spatial, temporal, joint)

image-20230409000736089

Summary
  • Spatial adaptation + temporal adaptation + joint adptation
  • image-20230404183110771