Introduction

本文討論兩件事:

  • Dataset: Alpaca 和 Wikitext2
  • Training 如何避免 OOM (Out-Of-Memory): 利用 Liger or Unsloth

Alpaca Dataset

Alpaca 資料集的目的、起源與相關資訊


Alpaca 資料集的目的

  1. 用於微調大型語言模型(LLM)的指令跟隨能力

    • 此資料集的主要目的是提升語言模型遵循人類指令的能力。
    • 訓練模型生成符合語境、與任務相關並對指令有良好回應的輸出。
  2. 低成本的資料集創建

    • Alpaca 資料集是作為專有指令資料集的經濟實惠替代方案而創建的。
    • 創建者藉助像 OpenAI GPT 這樣的生成模型,展示了如何以低成本建立高質量的微調資料集。
  3. 教育與研究用途

    • 該資料集常用於研究、實驗及開源項目。
    • AI 社群常使用該資料集來微調如 LLaMA(Meta AI 的大型語言模型)等模型。

Alpaca 資料集的起源

  1. 由史丹佛大學開發
    • Alpaca 資料集最初由史丹佛大學的研究人員推出。
    • 它是為了微調由 Meta AI 開發的基礎模型 LLaMA 而創建的。
  2. 創建方法
    • 研究人員使用 OpenAI 的 text-davinci-003(GPT-3.5)模型生成了一個合成資料集。
    • 種子指令:使用了 175 條手動撰寫的指令作為起始。
    • 增強:GPT-3.5 將這些種子指令擴展為 52,000 條獨特的指令-回應配對,通過改寫、添加多樣性並確保覆蓋範圍。
  3. 發布與可訪問性
    • 作為研究的一部分,資料集被公開發布以促進指令跟隨模型的訓練。
    • 該資料集是公開的,但因其依賴於 OpenAI 模型的輸出,需遵守相關的授權規範。

資料集的結構

Alpaca 資料集包含:

  • Instruction(指令):描述需要完成的任務(如「總結以下文章」)。
  • Input(輸入):提供額外上下文的可選輸入(如需要總結的文章)。
  • Output(輸出):指令的預期回應。

相關資料集與項目

  1. Stanford Alpaca 與其他資料集的比較

    • 與如 OpenAI 專有指令資料集 不同,Alpaca 是合成生成且公開的。
    • 其他資料集如 FLAN(Fine-Tuned LAnguage Net)與 T0(Trained on Tasks)也專注於指令微調,但方法與來源可能有所不同。
  2. 衍生變體

    • Yahma 的 Alpaca-Cleaned:Alpaca 原始資料集的更清晰版本,專注於去除噪音與不一致。
    • AlpaGasus:一個延伸變體,包含多輪對話示例以增強對話能力。

在 AI 研究中的重要性

  1. 開源創新

    • 該資料集顯示了如何以高效經濟的方式進行指令任務微調。
    • 它啟發了許多開源微調模型,如 Alpaca-LLaMA。
  2. 倫理問題

    • 使用 GPT 生成的資料引發了版權與知識產權的相關問題。
    • 研究人員在使用這些資料集時需確保遵守授權條款。
  3. 教育用途

    • Alpaca 資料集經常用於教學,演示微調和指令跟隨任務。

進一步閱讀

  • Stanford Alpaca 官方部落格:描述資料集的創建方法與目標。
  • Hugging Face:提供多種基於 Alpaca 的模型與衍生資料集。
  • Meta AI 的 LLaMA 模型:為微調的基礎模型提供上下文背景。

How to write a good paper: https://www.youtube.com/watch?v=1hEI_yIkIl0&ab_channel=Tunadorable

注意一定要在 tokenizer function 設定 max_seq_length, 其他地方都不用設。不然會用 model default 的 max_length 就是 position encoder 的大小。讓 GPU memory 和 loss 結果都會錯!

如果是用 Hugging Face 的 Trainer class 可以不用 label? NO 可是如果是直接算 loss, 應該需要 label? NO. Trainer class 要 label.

tokenizer: pad, truncate 設在 left or right 有差別嗎? 沒差別

1
2
3
4
5
6
7
8
9
10
def tokenize_function(examples):
    tokenized_examples = tokenizer(examples['text'], truncation=True,
      padding='max_length', max_length=max_seq_length,return_tensors='pt')
    tokenized_examples['labels'] = tokenized_examples['input_ids'].clone()  # Create labels column as a copy of input_ids
    return tokenized_examples

tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True
)

合并 Liger 的 memory reduction, 結論

  • Liger 只有一點 gain (<1GB), 可以讓 Llama3.2-3B 處理的 batch (512 legnth) 從 4 (without Liger) 變成 8 (with Liger). 不確定是我設定錯什麼。
  • Liger 使用 triton, 自動 default GPU (這是 triton 特性還是 Liger library?) 在 CPU debug 會有 error!
  • 避免用 batch = 1/2, initial loss 太大,收斂慢
  • batch = 8 比 batch = 4 更有效率。這好像是 Nvidia GPU 的特性。
  • WikiText-2 的 dataset 比較穩定。比 Shakespeare 大 10 倍,但比 Alpaca 小。 Alpaca 無法穩定收斂,應該是設定問題?
  • Alpaca 的 loss 是 1.3 (self-entropy)

Model 小 (1B), batch 大 (8 以上) 看起來的 gain 就很不錯, 有兩倍的差異!

Dataset obervation

  • Wikitext2: 比較簡單 self-entropy 比較小,所以 loss 可以到 0.3
  • Tatsu-lab/alpaca: 比較複雜,可能也是多語言,所以 self-entropy 比較大,loss 約在 1.3-1.4.
  • Yahma/alpaca-cleaned: 是 Alpaca 清理過的 dataset, 所以 self-entropy 比較也比較好 fit, loss 約在 1.2.

Differences Between Various Alpaca Datasets

Feature Stanford Alpaca Tatsu-Lab Alpaca Yahma Alpaca-cleaned
Source Stanford team, generated by text-davinci-003 Derived from Stanford Derived from Tatsu-Lab
Size ~52k ~52k ~51.8k
Cleaning Minimal (few checks) Minimal (some fixes) Comprehensive
Format Issues May have duplicates or errors Contains minor issues Fixed
Fields instruction, input, output Same as Stanford Same as Stanford
Enhancements Basic instruction tuning Slight improvements Thorough cleanup
Usage General fine-tuning Fine-tuning with minimal cleaning Robust fine-tuning

Comparison with Other Alpaca Datasets

Dataset num_rows features Quality Focus Notable Features
Stanford Alpaca 52,002 output, input, instruction General diversity Generated with text-davinci-003; varied quality
Tatsu Lab Alpaca (最接近 Stanford) 52,002 output, input, instruction, text Instruction-focused Structured for easy integration with Hugging Face; customizable training parameters
Yahma Alpaca Cleaned 51,760 output, input, instruction Improved quality Need to create text. Longer prompts; reduced noise; better performance
AlpaGasus ~9,000   High-quality selection Filtered using GPT-3.5-Turbo for scoring, no datasets host
Wikitext2 36,718, 3,760, 4,358 train, validation, test    
Model Size, Precision Length Dataset Batch Liger GPU DRAM Loss
Llama3.2-3B 6.5GB, FP16 512 Tatsu Alpaca 1 Y A100 40GB 31GB 2.xx
        1 N   31.7GB 2.xx
        4 Y   31.9/31.2GB 1.3
        4 N   34.4GB 2.xx
  use STFtrainer     8 Y   35/38GB 1.3
        8 N   OOM>40G 2.xx
        16 Y   40GB 2.xx
        16 N   OOM>40G 2.xx
      WikiText-2 1 Y   30.8GB ~0.5
        1 N   31.5GB ~0.5
        2 Y   31.1GB 0.3
        2 N   33.4GB 0.3
        4 Y   33.2GB 0.31
        4 N   33.9GB 0.31
        8 Y   32.7GB 0.3
        8 N   OOM>40G  
        16 Y   OOM>40G  
        16 N   OOM>40G  
                 
                 
Llama3.2-1B 2.5GB, BF16 512 WikiText-2 1 Y A100 40GB 12.1GB ~0.4
        1 N   12.6GB ~0.4
        8 Y   14.2GB 0.4
        8 N   24.7GB 0.4
        16 Y   19GB 0.35
        16 N   38.7GB 0.35
        32 Y   29.7GB 0.34
        32 N   OOM>40G 0.34
  use STFtrainer 512 Alpaca 8 Y L4 23GB 17GB 1.4
        8 N   OOM>23G  
                 
  use STFtrainer 512 TatsuAlpaca 8 Y 3060 12GB 12GB 1.4
                 
                 
  use Trainer   Wikitext2 8 Y L4 14.2GB 0.35
      - - 3060 12GB 0.31
gpt2 (120M) 0.5GB FP32 1024 Wikitext2 8 N 3060 12GB 12GB 0.23
  pad left! 1024 - - - 12GB 0.23
    512 - - - 9.3GB 0.5
    300 - - - 6.7GB 0.8
                 

The table is now sorted as requested. Let me know if there’s anything else to modify!

Source ChatGPT

計算 perplexity, 一般用 WikiText-2 因為比較小,品質比較好。

Comparison Table

Feature WikiText-2 WikiText-103 enwik8
Size ~2M tokens ~103M tokens ~100M characters
Vocabulary Size ~33,000 tokens ~267,000 tokens N/A (raw character-level)
Preprocessing Minimal Minimal None (includes raw text)
Task Focus Word-level modeling Word-level modeling Character-level modeling
Use Cases Small-scale experiments Large-scale pretraining Byte/character-level tasks
Computational Cost Low High Moderate

The WikiText-2, WikiText-103, and enwik8 datasets are commonly used for training and evaluating language models. Here’s a detailed comparison of their differences:


1. WikiText-2

  • Description:

    • A smaller version of the WikiText-103 dataset.
    • Contains high-quality, clean English text extracted from Wikipedia articles.
  • Characteristics:

    • Size:
      • Training: ~2 million tokens
      • Validation: ~217,000 tokens
      • Test: ~245,000 tokens
    • Vocabulary: ~33,000 unique tokens.
    • Designed for quick experimentation and smaller-scale model development.
  • Use Case:

    • Useful for testing new language model architectures or techniques without requiring significant computational resources.
  • Trade-offs:

    • Smaller corpus limits its usefulness for pretraining large language models.
    • Overfitting is a concern for larger models.

2. WikiText-103

  • Description:

    • A larger, more comprehensive version of the WikiText-2 dataset.
    • Extracted from English Wikipedia, with minimal preprocessing to maintain the natural structure of sentences and paragraphs.
  • Characteristics:

    • Size:
      • Training: ~103 million tokens
      • Validation: ~217,000 tokens
      • Test: ~245,000 tokens
    • Vocabulary: ~267,000 unique tokens.
    • Retains long-term dependencies by preserving full article structure.
    • Includes rare and less frequent words due to its larger size.
  • Use Case:

    • Suitable for training larger language models.
    • Useful for evaluating long-range dependency handling in language models.
  • Trade-offs:

    • Requires more computational resources compared to WikiText-2.
    • Slower for rapid prototyping.

3. enwik8

  • Description:

    • A dataset derived from the first 100 million characters of an English Wikipedia XML dump.
    • Focuses on character-level language modeling rather than token-based processing.
  • Characteristics:

    • Size:
      • Training: ~90 million characters
      • Validation: ~5 million characters
      • Test: ~5 million characters
    • Processed as raw text, meaning punctuation, HTML tags, and special characters are included.
    • Designed for character-level tasks, unlike WikiText which is word-level.
  • Use Case:

    • Character-level language model research and compression algorithms.
    • Ideal for exploring models with byte-level representations or subword tokenization.
  • Trade-offs:

    • Requires more steps to tokenize and preprocess compared to WikiText datasets.
    • May not be as suitable for word-level language modeling tasks.


Summary

  • WikiText-2: Best for quick experiments and smaller models.
  • WikiText-103: Preferred for pretraining or evaluating word-level models on long-range dependencies.
  • enwik8: Ideal for character-level tasks or byte-level processing research.

Here are snippets from the datasets WikiText-2, WikiText-103, and Enwik8. These snippets are extracted based on the general characteristics of the datasets:


WikiText-2

  • Format: Plain text, tokenized with spaces.
  • Style: Contains diverse topics, but focuses on structured sentences with proper grammar.
1
2
3
 = Valkyria Chronicles III = 

 Valkyria Chronicles III: Unrecorded Chronicles is a tactical role-playing video game developed by Sega and released for the PlayStation Portable in Japan. 

WikiText-103

  • Format: Plain text, similar to WikiText-2 but much larger in size (over 100 million tokens).
  • Style: Same format and structure as WikiText-2 but covering a much broader range of topics and depth.
1
2
3
 = Natural Language Processing = 

 Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.

Enwik8

  • Format: Raw ASCII text (character-level).
  • Style: Includes Wikipedia content but focuses on unprocessed text (e.g., no tokenization, retains all special characters, and formatted as characters).
1
2
3
4
5
<text xml:space="preserve" bytes="473000">British Isles

The British Isles are a group of islands off the north-western coast of continental Europe that include the islands of Great Britain, Ireland and over six thousand smaller isles. There are two sovereign states located on the islands: the United Kingdom of Great Britain and Northern Ireland, and Ireland. The British Isles also include the Crown Dependencies of the Isle of Man and the Channel Islands (which are considered part of the British Isles by the UK government).

</text>

Key Observations

  1. Tokenization:

    • WikiText-2 and WikiText-103: Tokenized at the word level, stored as plain text.
    • Enwik8: Character-level, raw ASCII format.
  2. Structure:

    • WikiText datasets are relatively clean and focus on proper Wikipedia articles.
    • Enwik8 is raw, retains formatting and metadata like <text> tags.
  3. Purpose:

    • WikiText-2: Suitable for quick experiments and small-scale language modeling.
    • WikiText-103: Large-scale language modeling with diverse and deep content.
    • Enwik8: Focuses on character-level language modeling.

Let me know if you’d like a code snippet to directly view these datasets in Python.

Appendix

Hugging Face Perplexity Example

使用 GPT-2 ($n_{ctx} = 1024$) 計算 perplexity.

這裏使用 Hugging face 的 GPT2 和 WikiText dataset. 要事先 install 以下 packages.

1
2
pip install transformer
pip install datasets
1
2
3
4
5
6
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

device = "cuda"
model_id = "gpt2-large"
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

我們用 WikiText-2 dataset 評估 PPL.

1
2
3
4
from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import torch
from tqdm import tqdm

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

最後 ppl = 16.45.