Transformer for Speech Recognition

Google 是目前語音識別的 leader. 主要使用 RNN-T (Recurrent Neural Network Transducers) model. RNN-T 是 sequence-to-sequence model 的一種，沒有用到 attention mechanism. 另外一般的 sequence-to-sequence model 需要輸入完整的 input sequence, 才會產生 output sequence (e.g. 翻譯)。 RNN-T 可以連續處理 input samples and stream output symbols. 這對於 real-time speech recognition 非常重要。

RNN-T 的 input 是 phonemes, output 則是 characters of the alphabet.

一個 modified version with attention 如下 (Google 2017) with better WER (word error rate).

更進一步的是用 transformer base ASR - Conformer (Google 2020). Conformer是Google在2020年提出的語音識別模型，基於Transformer改進而來，主要的改進點在於Transformer在提取長序列依賴的時候更有效，而卷積則擅長提取局部特徵，因此將卷積應用於Transformer的Encoder層，同時提升模型在長期序列和局部特徵上的效果，實際證明，該方法確實有效，在當時的LibriSpeech測試集上取得了最好的效果。完全捨棄 recurrent model.

Attention Mechanism

Spectrum

	Pure Transformer	Transformer+CNN	CNN+Attention	Pure CNN
	attention on image patch	attention on image/feature patch	CNN layer+attention	CNN layer
		feature is from CNN	後面的layer 比較有用
	三大問題： 1. image patch receptive field 2. position embedding 3. reduce attention computation		Con Only local attention! No global attention
			Pro No need for position embedding
Ex.	ViT, Swin	DeiT-S (ViT-S)		ResNet-50

通過如上三個階段的計算，即可求出針對Query的Attention數值，目前絕大多數具體的注意力機制計算方法都符合上述的三階段抽象計算過程。

總結 attention： $\operatorname{Attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=\operatorname{softmax}\left(\frac{\boldsymbol{Q} \boldsymbol{K}^{\top}}{\sqrt{d_{k}}}\right) \boldsymbol{V}$ where $\boldsymbol{Q} \in \mathbb{R}^{n \times d_{k}}, \boldsymbol{K} \in \mathbb{R}^{m \times d_{k}}, \boldsymbol{V} \in \mathbb{R}^{m \times d_{v}}$ 。如果忽略 activation function softmax，事實上這就是三個矩陣相乘，最後的結果是 $n\times d_v$ 的矩陣。所以我們可以認爲：這是一個 attention layer, 將 $n\times d_k$ 的序列 Q 編碼成一個新的 $n\times d_v$ 的序列。

###

Vision Transformer

是否能把 transformer 作爲 vision backbone network, 類似 CNN 在 vision backbone network 的角色。可以用於 vision 底層 task 如 detection, segmentation, 甚至 quality enhancement；以及中高層的 vision task 像是 classification, image caption, 等等。

CNN 的優點是：computation efficiency, low level feature extraction (filter-like building block), translation equivariant for low level vision。藉著 bottom up to increase the receptive field and pooling to offer scaling invariant and translation invariant.

	CNN in Vision	Transformer in Vision	Transformer in NLP
Low level vision translation equivariant	Yes	Need to solve	No need
Low level vision scale equivariant	CNN Pyramid		No need
High level translation invariant	feature pyramid+FC layer		Yes
High level scale invariant	feature pyramid+FC layer		NO
High level Permutation sensitive	Position embedding		Position encoding
Vision charateristics	best fit for vision spatial locality		No constraint by vision locality

本文主要討論 transformer 用於 vision. 有兩種類型: (I) pure transformer, Vision Transformer or ViT; (II) combined CNN + attention （本文不討論）.

Why introduce transformer or attention to vision? (1) better performance for classification, detection, segmentation, or quality; (2) vision + NLP tasks such as vision caption, etc.

CNN Pro and Con

	CNN	Transformer/Attention
Receptive field	small, need deep layers	Wide, use FC layer
Feature	local feature	long range feature
Scope	bottom up	Top down

combine feature extraction and …?

Pure transformer

Transformer + CNN

Self Attention Module (Type II, Use CNN feature map, 本文不討論)

Straightforward:

image or feature patch patch use for self-attention

Feature patch use for self-attention

三部曲：

image
Image pyramid (spatial pyramid) for scaling invariant
Image + feature pyramid (CNN) for scaling invariant and receptive field

Transformer Pro and Con

Separate feature extraction and …?

Attention Mechanism

Vision Transformer

Self Attention Module (Type II, Use CNN feature map, 本文不討論)

Reference