Introduction
本文討論兩件事:
- Dataset: Alpaca 和 Wikitext2
- Training 如何避免 OOM (Out-Of-Memory): 利用 Liger or Unsloth
Alpaca Dataset
Alpaca 資料集的目的、起源與相關資訊
Alpaca 資料集的目的
-
用於微調大型語言模型(LLM)的指令跟隨能力:
- 此資料集的主要目的是提升語言模型遵循人類指令的能力。
- 訓練模型生成符合語境、與任務相關並對指令有良好回應的輸出。
-
低成本的資料集創建:
- Alpaca 資料集是作為專有指令資料集的經濟實惠替代方案而創建的。
- 創建者藉助像 OpenAI GPT 這樣的生成模型,展示了如何以低成本建立高質量的微調資料集。
-
教育與研究用途:
- 該資料集常用於研究、實驗及開源項目。
- AI 社群常使用該資料集來微調如 LLaMA(Meta AI 的大型語言模型)等模型。
Alpaca 資料集的起源
- 由史丹佛大學開發:
- Alpaca 資料集最初由史丹佛大學的研究人員推出。
- 它是為了微調由 Meta AI 開發的基礎模型 LLaMA 而創建的。
- 創建方法:
- 研究人員使用 OpenAI 的 text-davinci-003(GPT-3.5)模型生成了一個合成資料集。
- 種子指令:使用了 175 條手動撰寫的指令作為起始。
- 增強:GPT-3.5 將這些種子指令擴展為 52,000 條獨特的指令-回應配對,通過改寫、添加多樣性並確保覆蓋範圍。
- 發布與可訪問性:
- 作為研究的一部分,資料集被公開發布以促進指令跟隨模型的訓練。
- 該資料集是公開的,但因其依賴於 OpenAI 模型的輸出,需遵守相關的授權規範。
資料集的結構
Alpaca 資料集包含:
- Instruction(指令):描述需要完成的任務(如「總結以下文章」)。
- Input(輸入):提供額外上下文的可選輸入(如需要總結的文章)。
- Output(輸出):指令的預期回應。
相關資料集與項目
-
Stanford Alpaca 與其他資料集的比較:
- 與如 OpenAI 專有指令資料集 不同,Alpaca 是合成生成且公開的。
- 其他資料集如 FLAN(Fine-Tuned LAnguage Net)與 T0(Trained on Tasks)也專注於指令微調,但方法與來源可能有所不同。
-
衍生變體:
- Yahma 的 Alpaca-Cleaned:Alpaca 原始資料集的更清晰版本,專注於去除噪音與不一致。
- AlpaGasus:一個延伸變體,包含多輪對話示例以增強對話能力。
在 AI 研究中的重要性
-
開源創新:
- 該資料集顯示了如何以高效經濟的方式進行指令任務微調。
- 它啟發了許多開源微調模型,如 Alpaca-LLaMA。
-
倫理問題:
- 使用 GPT 生成的資料引發了版權與知識產權的相關問題。
- 研究人員在使用這些資料集時需確保遵守授權條款。
-
教育用途:
- Alpaca 資料集經常用於教學,演示微調和指令跟隨任務。
進一步閱讀
- Stanford Alpaca 官方部落格:描述資料集的創建方法與目標。
- Hugging Face:提供多種基於 Alpaca 的模型與衍生資料集。
- Meta AI 的 LLaMA 模型:為微調的基礎模型提供上下文背景。
How to write a good paper: https://www.youtube.com/watch?v=1hEI_yIkIl0&ab_channel=Tunadorable
注意一定要在 tokenizer function 設定 max_seq_length, 其他地方都不用設。不然會用 model default 的 max_length 就是 position encoder 的大小。讓 GPU memory 和 loss 結果都會錯!
如果是用 Hugging Face 的 Trainer class 可以不用 label? NO 可是如果是直接算 loss, 應該需要 label? NO. Trainer class 要 label.
tokenizer: pad, truncate 設在 left or right 有差別嗎? 沒差別
1 | |
合并 Liger 的 memory reduction, 結論
- Liger 只有一點 gain (<1GB), 可以讓 Llama3.2-3B 處理的 batch (512 legnth) 從 4 (without Liger) 變成 8 (with Liger). 不確定是我設定錯什麼。
- Liger 使用 triton, 自動 default GPU (這是 triton 特性還是 Liger library?) 在 CPU debug 會有 error!
- 避免用 batch = 1/2, initial loss 太大,收斂慢
- batch = 8 比 batch = 4 更有效率。這好像是 Nvidia GPU 的特性。
- WikiText-2 的 dataset 比較穩定。比 Shakespeare 大 10 倍,但比 Alpaca 小。 Alpaca 無法穩定收斂,應該是設定問題?
- Alpaca 的 loss 是 1.3 (self-entropy)
Model 小 (1B), batch 大 (8 以上) 看起來的 gain 就很不錯, 有兩倍的差異!
Dataset obervation
- Wikitext2: 比較簡單 self-entropy 比較小,所以 loss 可以到 0.3
- Tatsu-lab/alpaca: 比較複雜,可能也是多語言,所以 self-entropy 比較大,loss 約在 1.3-1.4.
- Yahma/alpaca-cleaned: 是 Alpaca 清理過的 dataset, 所以 self-entropy 比較也比較好 fit, loss 約在 1.2.
Differences Between Various Alpaca Datasets
| Feature | Stanford Alpaca | Tatsu-Lab Alpaca | Yahma Alpaca-cleaned |
|---|---|---|---|
| Source | Stanford team, generated by text-davinci-003 |
Derived from Stanford | Derived from Tatsu-Lab |
| Size | ~52k | ~52k | ~51.8k |
| Cleaning | Minimal (few checks) | Minimal (some fixes) | Comprehensive |
| Format Issues | May have duplicates or errors | Contains minor issues | Fixed |
| Fields | instruction, input, output |
Same as Stanford | Same as Stanford |
| Enhancements | Basic instruction tuning | Slight improvements | Thorough cleanup |
| Usage | General fine-tuning | Fine-tuning with minimal cleaning | Robust fine-tuning |
Comparison with Other Alpaca Datasets
| Dataset | num_rows | features | Quality Focus | Notable Features |
|---|---|---|---|---|
| Stanford Alpaca | 52,002 | output, input, instruction | General diversity | Generated with text-davinci-003; varied quality |
| Tatsu Lab Alpaca (最接近 Stanford) | 52,002 | output, input, instruction, text | Instruction-focused | Structured for easy integration with Hugging Face; customizable training parameters |
| Yahma Alpaca Cleaned | 51,760 | output, input, instruction | Improved quality | Need to create text. Longer prompts; reduced noise; better performance |
| AlpaGasus | ~9,000 | High-quality selection | Filtered using GPT-3.5-Turbo for scoring, no datasets host | |
| Wikitext2 | 36,718, 3,760, 4,358 | train, validation, test |
| Model | Size, Precision | Length | Dataset | Batch | Liger | GPU | DRAM | Loss |
|---|---|---|---|---|---|---|---|---|
| Llama3.2-3B | 6.5GB, FP16 | 512 | Tatsu Alpaca | 1 | Y | A100 40GB | 31GB | 2.xx |
| 1 | N | 31.7GB | 2.xx | |||||
| 4 | Y | 31.9/31.2GB | 1.3 | |||||
| 4 | N | 34.4GB | 2.xx | |||||
| use STFtrainer | 8 | Y | 35/38GB | 1.3 | ||||
| 8 | N | OOM>40G | 2.xx | |||||
| 16 | Y | 40GB | 2.xx | |||||
| 16 | N | OOM>40G | 2.xx | |||||
| WikiText-2 | 1 | Y | 30.8GB | ~0.5 | ||||
| 1 | N | 31.5GB | ~0.5 | |||||
| 2 | Y | 31.1GB | 0.3 | |||||
| 2 | N | 33.4GB | 0.3 | |||||
| 4 | Y | 33.2GB | 0.31 | |||||
| 4 | N | 33.9GB | 0.31 | |||||
| 8 | Y | 32.7GB | 0.3 | |||||
| 8 | N | OOM>40G | ||||||
| 16 | Y | OOM>40G | ||||||
| 16 | N | OOM>40G | ||||||
| Llama3.2-1B | 2.5GB, BF16 | 512 | WikiText-2 | 1 | Y | A100 40GB | 12.1GB | ~0.4 |
| 1 | N | 12.6GB | ~0.4 | |||||
| 8 | Y | 14.2GB | 0.4 | |||||
| 8 | N | 24.7GB | 0.4 | |||||
| 16 | Y | 19GB | 0.35 | |||||
| 16 | N | 38.7GB | 0.35 | |||||
| 32 | Y | 29.7GB | 0.34 | |||||
| 32 | N | OOM>40G | 0.34 | |||||
| use STFtrainer | 512 | Alpaca | 8 | Y | L4 23GB | 17GB | 1.4 | |
| 8 | N | OOM>23G | ||||||
| use STFtrainer | 512 | TatsuAlpaca | 8 | Y | 3060 12GB | 12GB | 1.4 | |
| use Trainer | Wikitext2 | 8 | Y | L4 | 14.2GB | 0.35 | ||
| – | - | - | 3060 | 12GB | 0.31 | |||
| gpt2 (120M) | 0.5GB FP32 | 1024 | Wikitext2 | 8 | N | 3060 12GB | 12GB | 0.23 |
| pad left! | 1024 | – | - | - | - | 12GB | 0.23 | |
| 512 | – | - | - | - | 9.3GB | 0.5 | ||
| 300 | – | - | - | - | 6.7GB | 0.8 | ||
The table is now sorted as requested. Let me know if there’s anything else to modify!
Source ChatGPT
計算 perplexity, 一般用 WikiText-2 因為比較小,品質比較好。
Comparison Table
| Feature | WikiText-2 | WikiText-103 | enwik8 |
|---|---|---|---|
| Size | ~2M tokens | ~103M tokens | ~100M characters |
| Vocabulary Size | ~33,000 tokens | ~267,000 tokens | N/A (raw character-level) |
| Preprocessing | Minimal | Minimal | None (includes raw text) |
| Task Focus | Word-level modeling | Word-level modeling | Character-level modeling |
| Use Cases | Small-scale experiments | Large-scale pretraining | Byte/character-level tasks |
| Computational Cost | Low | High | Moderate |
The WikiText-2, WikiText-103, and enwik8 datasets are commonly used for training and evaluating language models. Here’s a detailed comparison of their differences:
1. WikiText-2
-
Description:
- A smaller version of the WikiText-103 dataset.
- Contains high-quality, clean English text extracted from Wikipedia articles.
-
Characteristics:
- Size:
- Training: ~2 million tokens
- Validation: ~217,000 tokens
- Test: ~245,000 tokens
- Vocabulary: ~33,000 unique tokens.
- Designed for quick experimentation and smaller-scale model development.
- Size:
-
Use Case:
- Useful for testing new language model architectures or techniques without requiring significant computational resources.
-
Trade-offs:
- Smaller corpus limits its usefulness for pretraining large language models.
- Overfitting is a concern for larger models.
2. WikiText-103
-
Description:
- A larger, more comprehensive version of the WikiText-2 dataset.
- Extracted from English Wikipedia, with minimal preprocessing to maintain the natural structure of sentences and paragraphs.
-
Characteristics:
- Size:
- Training: ~103 million tokens
- Validation: ~217,000 tokens
- Test: ~245,000 tokens
- Vocabulary: ~267,000 unique tokens.
- Retains long-term dependencies by preserving full article structure.
- Includes rare and less frequent words due to its larger size.
- Size:
-
Use Case:
- Suitable for training larger language models.
- Useful for evaluating long-range dependency handling in language models.
-
Trade-offs:
- Requires more computational resources compared to WikiText-2.
- Slower for rapid prototyping.
3. enwik8
-
Description:
- A dataset derived from the first 100 million characters of an English Wikipedia XML dump.
- Focuses on character-level language modeling rather than token-based processing.
-
Characteristics:
- Size:
- Training: ~90 million characters
- Validation: ~5 million characters
- Test: ~5 million characters
- Processed as raw text, meaning punctuation, HTML tags, and special characters are included.
- Designed for character-level tasks, unlike WikiText which is word-level.
- Size:
-
Use Case:
- Character-level language model research and compression algorithms.
- Ideal for exploring models with byte-level representations or subword tokenization.
-
Trade-offs:
- Requires more steps to tokenize and preprocess compared to WikiText datasets.
- May not be as suitable for word-level language modeling tasks.
Summary
- WikiText-2: Best for quick experiments and smaller models.
- WikiText-103: Preferred for pretraining or evaluating word-level models on long-range dependencies.
- enwik8: Ideal for character-level tasks or byte-level processing research.
Here are snippets from the datasets WikiText-2, WikiText-103, and Enwik8. These snippets are extracted based on the general characteristics of the datasets:
WikiText-2
- Format: Plain text, tokenized with spaces.
- Style: Contains diverse topics, but focuses on structured sentences with proper grammar.
1 | |
WikiText-103
- Format: Plain text, similar to WikiText-2 but much larger in size (over 100 million tokens).
- Style: Same format and structure as WikiText-2 but covering a much broader range of topics and depth.
1 | |
Enwik8
- Format: Raw ASCII text (character-level).
- Style: Includes Wikipedia content but focuses on unprocessed text (e.g., no tokenization, retains all special characters, and formatted as characters).
1 | |
Key Observations
-
Tokenization:
- WikiText-2 and WikiText-103: Tokenized at the word level, stored as plain text.
- Enwik8: Character-level, raw ASCII format.
-
Structure:
- WikiText datasets are relatively clean and focus on proper Wikipedia articles.
- Enwik8 is raw, retains formatting and metadata like
<text>tags.
-
Purpose:
- WikiText-2: Suitable for quick experiments and small-scale language modeling.
- WikiText-103: Large-scale language modeling with diverse and deep content.
- Enwik8: Focuses on character-level language modeling.
Let me know if you’d like a code snippet to directly view these datasets in Python.
Appendix
Hugging Face Perplexity Example
使用 GPT-2 ($n_{ctx} = 1024$) 計算 perplexity.
這裏使用 Hugging face 的 GPT2 和 WikiText dataset. 要事先 install 以下 packages.
1 | |
1 | |
我們用 WikiText-2 dataset 評估 PPL.
1 | |
1 | |
最後 ppl = 16.45.