Neuro Synapse Analogy

Chemical synapse: weight
Electrical synapse: shortcut
How about adjacent modulation synapse? 可以用數學 model 嗎?

Takeaway

Train Mamba: https://www.youtube.com/watch?v=qUfZruIKwtc&ab_channel=Oxen
https://www.youtube.com/watch?v=qUfZruIKwtc&ab_channel=Oxen. –> for RAG

https://www.youtube.com/@oxen-ai/videos. –> Oxen is an excellent programmer

Reference

Hepta. “How to Judge RWKV (arXiv 2305.13048)？,” September 15, 2023. https://www.zhihu.com/question/602564718/answer/3211669817.

[Efficiently Modeling Long Sequences with Structured State Spaces - Albert Gu

Stanford MLSys #46 (youtube.com)](https://www.youtube.com/watch?v=EvQ3ncuriCM)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained) (youtube.com)

https://www.youtube.com/watch?v=8Q_tqwpTpVU&ab_channel=UmarJamil

https://www.youtube.com/watch?v=iskuX3Ak9Uk&ab_channel=TrelisResearch. Good!

Bone Piezoelectric Effect is used for AI?

利用 Nystromer random select and update (Performer) 的特性。定期清除不用的 weight or kv cache, 但是加强常用的 weight? 或是改變 quantization level. 不用的 int8-> int4 -> int2.

寫一本書類似 Nobel prize winner but use for companies. input x: EO level: inclusive vs. .. input y: working level: inclusive vs. z: capital, and gross margin

PyTorch Transformer Usage

Input: x

PyTorch tensor default 是 batch_last, 記得要設定成 batch_first, 完整 3D tensor, input embedding. [batch_size, block_size, embedding]

如果 x 在内部，可能會有 4D tensor: multi-head, embedding 常常拆成 head_num x head_size, 就變成 4D tensor [batch_size, block_size, head_num, head_size]

How to set the block_size windows ? 如何 turn-on KV cache? ? 如何 on-off bias ?

Huggingface Transformer Usage

Input: x

HF transformers tensor default 是 batch_first, 完整 3D tensor, input embedding. [batch_size, block_size, embedding]

如果 x 在内部，可能會有 4D tensor: multi-head, embedding 常常拆成 head_num x head_size, 就變成 4D tensor [batch_size, block_size, head_num, head_size]

Training and Inference

attention_mask: 先定義 $M_{att} Q$ 的範圍，再算 $QK^T$

attention_mask 用在 input, 兩個用法

Input token length < context length (=block_size).
特別是在 batch > 1 需要做 padding alignement 最長 token batch. 可以用 attention_mask ignore padding 部分。
$M_{att}$ 的樣子

attention_mask = [batch_size, block_size, head_size] = [batch_size, token_length, -1 ].
batch1 2D: [ 1 1 1 1 0 0 0] … [1 1 1 1 0 0 0] 只需要處理到 batch maximum token_length batch2 2D: [ 1 1 1 1 1 1 1 ] … [ 1 1 1 1 1 1 1 ] batch3 2D: [ 1 1 1 0 0 ] … [ 1 1 1 0 0 ]

causal_mask: $M_{att} Q K^T + M_{causal}$, 用在 transformer decoder (GPT-like). 但是 encoder 不用，例如 BERT.

causal_mask 不管 batch, 或是 block_size. 但使用 causal_mask = [batch_size, block_size, head_size] = [batch_size, token_length, -1 ].
batch1: [0 -inf -inf -.. -inf] [0 0 -inf, ..] [0 0 0 -inf, ..] 只需要處理到各自的 token_length batch2: same as batch 1

Both attention_mask and causal_mask serve the same purpose, namely masking out (leaving out) tokens which shouldn’t participate in the attention computations.

The attention_mask is mainly used to mask out padding tokens, or other special tokens which one doesn’t want to include in the attention computations (padding tokens for instance are only used to ensure all sequences are of the same length so that sentences can be batched together for training). In the HF Transformers library, the tokenizer automatically creates the attention_mask for you and it’s another input to the model besides input_ids.
The causal mask serves the same purpose, but is only used by decoder-only (and also for the decoder part of encoder-decoder) models to ensure that the future is masked (the attention computation of a given token should not depend on tokens that come after it). This is to ensure models are trained to predict the next token, and no information gets “leaked”. In the HF Transformers library, this is all taken care of by the model itself, users don’t need to do anything regarding ensuring a causal mask is used.

以下 padding 在左或右的解釋震毀我三觀。不過好像蠻有道理！而且根據 ChatGPT, padding and truncation 都是在左邊！原因如下！而且 padding 在左或右是 embedded in tokenizer (在 transformer 如此)。和 training or inferencing 無關。

Padding and truncation side decisions (left vs right) depend on the requirements of the model architecture and the specific use case. While right-padding and right-truncation are common defaults, there are valid reasons to use left-padding and left-truncation in specific scenarios.

Reasons for Left Padding and Truncation

Causal Language Models (e.g., GPT):
- Models like GPT are causal language models that process input tokens in a left-to-right manner.
- When padding or truncation occurs on the left, the most recent tokens remain at the end of the input sequence. This ensures that:
  - The model focuses on the most relevant context (e.g., for predictions in autoregressive generation).
  - The position embeddings align naturally with the most recent tokens.
Efficient Handling of Variable-Length Sequences:
- In left-padding, the padding tokens (<pad>) occupy the beginning of the sequence. This reduces the computational burden during masked attention in autoregressive models because the padding tokens appear first and are easily ignored.
- Truncating from the left ensures that the most recent or important context is preserved, which is critical in tasks where the end of the sequence carries the most significant information.
Token Position Alignment:
- Models with absolute position embeddings (like GPT) rely on token positions being consistent. Padding on the left ensures the actual tokens align with their expected positions in the sequence.
Conversational Models:
- In conversation-based tasks, truncating from the left keeps the most recent conversation turns, which are typically more relevant for generating responses.
Model-Specific Design:
- Some models, particularly those trained with specific datasets or tasks, might expect input to be left-padded/truncated because that’s how they were trained.

Comparison of Left vs. Right Padding/Truncation

Aspect	Left Padding/Truncation	Right Padding/Truncation
Common Use Case	Autoregressive models (GPT)	Bidirectional models (BERT)
Focus on	Recent tokens	Earlier tokens
Attention Computation	Masked attention optimized for causal models	Generally applies to all tokens
Alignment with Outputs	Tokens align with expected positions	Positions shift due to padding/truncation

Key in the Code

The code snippet:

tokenizer = transformers.AutoTokenizer.from_pretrained(
    training_args.model_name,
    padding_side="left",
    truncation_side="left",
)
tokenizer.pad_token = tokenizer.eos_token

indicates that the model is likely:

An autoregressive model, where preserving recent context is more critical than earlier tokens.
Handling conversational or sequential data, which benefits from left-side padding/truncation.

By setting padding_side="left" and truncation_side="left", the code ensures alignment with the model’s expected input format and optimizes attention computations.

LLM Inference Only

model.generate() 生成 function, 就是 LLM inference

search algorithm:

greedy_search: pick the maximum probability
non-greedy search: use binomial distribution with temperature T

加速 algorithm: PLD (Prompt Lookup Decode) and SPD (Speculative Decode)

HF transformer file: transformers/src/transformers/generation/tf_utils.py

Class TFGenerationMixin
def generate

def _prepare_attention_mask_for_generation

greedy_search !!

HF transformer file: transformers/src/transformers/generation/utils.py

PromptLookupCandidateGenerator

Check!!! _assisted_decoding!!!!

? 如何 turn-on KV cache?

Use sentence VAE for AI agent! function call, etc.

Autoregressive decoding of sentence vectors as opposed to tokens (youtube.com) https://www.youtube.com/watch?v=TwUF1AngZm8&ab_channel=Tunadorable

RISC-V Day

key message:

HPC, AI, Auto
From the board: AI (accelerator), Auto, Security (accelerator)
what’s the ISA advantage? customizable
built-in ISA or external accelerator
industrial adopters: NV and Meta
SHD number
Is CHI important for AI multi-core and multi-cluster

MTIA1: NX27V for vector controller, automatical extension (ACE). MTIA2 (hotchip) Use SRAM CIM AI accelerator: Sapeon, Lightigence, Rain AI

Edge AI legal medical, enterprise manufacturing for predictive manintenace Network Auto

Extension

Matrix multiplication (standard)
- IME (integrated matrix extension)
- AME (attached matrix extension)
Nonlinear - custom instructions (use ACE)
Remaining computation: vector, DSP

Andes: IP thinking

AP: AX66: > 10 specint2k6. Cuzco: > 15-20 CHI interface for multi-cluster

AI SoC

scalable
vector -> Add iME
Use coherence fabric

SW: Compiler: TVM -> LLVM Runtime: TFLite -> XNNPack

V/M extension ACE Open SW RVA: 23

Rivos NX45

Tenstorrent. AP + tensix core: Scalability: HW System thinking

Primarily focus on data center

Personalization - more computation, nothing particular in Tenstorrent

Distribution computation for best efficiency - using human as an example

scalable - ? tensix / chiplet? ? How? distribute computing + CHI?
extensible - extend ISA semantics
efficient - less bagging from past ISA
stable - multiple suppliers

Question

Use Ethernet for both front-end and back-end ?
Multi-threads? multi-issues?
Power efficiency?

Mesh NOC from 1 Tensix

Rivos. Data Center: workload/SW define HW thinking

Llama, model (Mamba/Jamba) Tools: Inductor, Triton RAG (data and LLM stack)

vLLM/Llama + HF embedding + FAISS vector search Existing model
migration and optimization

Recompile not redesign

Linux SW

ISA -> compiler -> Lib -> kernel -> runtime -> framework

Memory size/BW Communication BW Reliability, Error containment Cost, Power, Cooling

XNNPack, Triton, Kubernets, Pytorch eager mode

RISC-V all the way

Question

Rivos (64-bit) + Andes (64-bit AP) + LowRISC Ibex (32-bit controller)
what’s the difference between Rivos core vs. Andes core? Why?

RISE

Microchip

DSA (domain specific architecture)
Open

Ventana: Data Center HPC

Auto Data Center Edge AI Gen AI

Scaling Laws

Training: the optimal trade-off for model parameter to training tokens: 1 : 20. (Llama3 405B at 15.6T token is 1: 40) Inference: the optimal trade-off for model parameter to inferencing tokens: 1: 150

Hybrid AI

Use American car as an example of energy consumption cause the problem. Then the Japanese car rise due to better power efficiency and just enough performance
LLM issues:
1. Hallucination: RAG
2. No logic: CoT + Search (use GO as an example), GPT1o: trade-off training vs. inference speed.
3. Sensitive to prompt: Agent framework
4. Cost: smaller model: < 100B
5. Energy is the key, slow in supply