Wikitext and Alpaca Dataset

Introduction

本文討論兩件事：

Dataset: Alpaca 和 Wikitext2
Training 如何避免 OOM (Out-Of-Memory): 利用 Liger or Unsloth

Alpaca Dataset

Alpaca 資料集的目的、起源與相關資訊

Alpaca 資料集的目的

用於微調大型語言模型（LLM）的指令跟隨能力：
- 此資料集的主要目的是提升語言模型遵循人類指令的能力。
- 訓練模型生成符合語境、與任務相關並對指令有良好回應的輸出。
低成本的資料集創建：
- Alpaca 資料集是作為專有指令資料集的經濟實惠替代方案而創建的。
- 創建者藉助像 OpenAI GPT 這樣的生成模型，展示了如何以低成本建立高質量的微調資料集。
教育與研究用途：
- 該資料集常用於研究、實驗及開源項目。
- AI 社群常使用該資料集來微調如 LLaMA（Meta AI 的大型語言模型）等模型。

Alpaca 資料集的起源

由史丹佛大學開發：
- Alpaca 資料集最初由史丹佛大學的研究人員推出。
- 它是為了微調由 Meta AI 開發的基礎模型 LLaMA 而創建的。
創建方法：
- 研究人員使用 OpenAI 的 text-davinci-003（GPT-3.5）模型生成了一個合成資料集。
- 種子指令：使用了 175 條手動撰寫的指令作為起始。
- 增強：GPT-3.5 將這些種子指令擴展為 52,000 條獨特的指令-回應配對，通過改寫、添加多樣性並確保覆蓋範圍。
發布與可訪問性：
- 作為研究的一部分，資料集被公開發布以促進指令跟隨模型的訓練。
- 該資料集是公開的，但因其依賴於 OpenAI 模型的輸出，需遵守相關的授權規範。

資料集的結構

Alpaca 資料集包含：

Instruction（指令）：描述需要完成的任務（如「總結以下文章」）。
Input（輸入）：提供額外上下文的可選輸入（如需要總結的文章）。
Output（輸出）：指令的預期回應。

在 AI 研究中的重要性

開源創新：
- 該資料集顯示了如何以高效經濟的方式進行指令任務微調。
- 它啟發了許多開源微調模型，如 Alpaca-LLaMA。
倫理問題：
- 使用 GPT 生成的資料引發了版權與知識產權的相關問題。
- 研究人員在使用這些資料集時需確保遵守授權條款。
教育用途：
- Alpaca 資料集經常用於教學，演示微調和指令跟隨任務。

進一步閱讀

Stanford Alpaca 官方部落格：描述資料集的創建方法與目標。
Hugging Face：提供多種基於 Alpaca 的模型與衍生資料集。
Meta AI 的 LLaMA 模型：為微調的基礎模型提供上下文背景。

How to write a good paper: https://www.youtube.com/watch?v=1hEI_yIkIl0&ab_channel=Tunadorable

注意一定要在 tokenizer function 設定 max_seq_length, 其他地方都不用設。不然會用 model default 的 max_length 就是 position encoder 的大小。讓 GPU memory 和 loss 結果都會錯！

如果是用 Hugging Face 的 Trainer class 可以不用 label? NO 可是如果是直接算 loss, 應該需要 label? NO. Trainer class 要 label.

tokenizer: pad, truncate 設在 left or right 有差別嗎? 沒差別

def tokenize_function(examples):
    tokenized_examples = tokenizer(examples['text'], truncation=True,
      padding='max_length', max_length=max_seq_length,return_tensors='pt')
    tokenized_examples['labels'] = tokenized_examples['input_ids'].clone()  # Create labels column as a copy of input_ids
    return tokenized_examples

tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True
)

合并 Liger 的 memory reduction, 結論

Liger 只有一點 gain (<1GB), 可以讓 Llama3.2-3B 處理的 batch (512 legnth) 從 4 (without Liger) 變成 8 (with Liger). 不確定是我設定錯什麼。
Liger 使用 triton, 自動 default GPU (這是 triton 特性還是 Liger library?) 在 CPU debug 會有 error!
避免用 batch = 1/2, initial loss 太大，收斂慢
batch = 8 比 batch = 4 更有效率。這好像是 Nvidia GPU 的特性。
WikiText-2 的 dataset 比較穩定。比 Shakespeare 大 10 倍，但比 Alpaca 小。 Alpaca 無法穩定收斂，應該是設定問題？
Alpaca 的 loss 是 1.3 (self-entropy)

Model 小 (1B), batch 大 (8 以上) 看起來的 gain 就很不錯，有兩倍的差異！

Dataset obervation

Wikitext2: 比較簡單 self-entropy 比較小，所以 loss 可以到 0.3
Tatsu-lab/alpaca: 比較複雜，可能也是多語言，所以 self-entropy 比較大，loss 約在 1.3-1.4.
Yahma/alpaca-cleaned: 是 Alpaca 清理過的 dataset, 所以 self-entropy 比較也比較好 fit, loss 約在 1.2.

Differences Between Various Alpaca Datasets

Feature	Stanford Alpaca	Tatsu-Lab Alpaca	Yahma Alpaca-cleaned
Source	Stanford team, generated by `text-davinci-003`	Derived from Stanford	Derived from Tatsu-Lab
Size	~52k	~52k	~51.8k
Cleaning	Minimal (few checks)	Minimal (some fixes)	Comprehensive
Format Issues	May have duplicates or errors	Contains minor issues	Fixed
Fields	`instruction`, `input`, `output`	Same as Stanford	Same as Stanford
Enhancements	Basic instruction tuning	Slight improvements	Thorough cleanup
Usage	General fine-tuning	Fine-tuning with minimal cleaning	Robust fine-tuning

Comparison with Other Alpaca Datasets

Dataset	num_rows	features	Quality Focus	Notable Features
Stanford Alpaca	52,002	output, input, instruction	General diversity	Generated with text-davinci-003; varied quality
Tatsu Lab Alpaca (最接近 Stanford)	52,002	output, input, instruction, text	Instruction-focused	Structured for easy integration with Hugging Face; customizable training parameters
Yahma Alpaca Cleaned	51,760	output, input, instruction	Improved quality	Need to create text. Longer prompts; reduced noise; better performance
AlpaGasus	~9,000		High-quality selection	Filtered using GPT-3.5-Turbo for scoring, no datasets host
Wikitext2	36,718, 3,760, 4,358	train, validation, test

Model	Size, Precision	Length	Dataset	Batch	Liger	GPU	DRAM	Loss
Llama3.2-3B	6.5GB, FP16	512	Tatsu Alpaca	1	Y	A100 40GB	31GB	2.xx
				1	N		31.7GB	2.xx
				4	Y		31.9/31.2GB	1.3
				4	N		34.4GB	2.xx
	use STFtrainer			8	Y		35/38GB	1.3
				8	N		OOM>40G	2.xx
				16	Y		40GB	2.xx
				16	N		OOM>40G	2.xx
			WikiText-2	1	Y		30.8GB	~0.5
				1	N		31.5GB	~0.5
				2	Y		31.1GB	0.3
				2	N		33.4GB	0.3
				4	Y		33.2GB	0.31
				4	N		33.9GB	0.31
				8	Y		32.7GB	0.3
				8	N		OOM>40G
				16	Y		OOM>40G
				16	N		OOM>40G


Llama3.2-1B	2.5GB, BF16	512	WikiText-2	1	Y	A100 40GB	12.1GB	~0.4
				1	N		12.6GB	~0.4
				8	Y		14.2GB	0.4
				8	N		24.7GB	0.4
				16	Y		19GB	0.35
				16	N		38.7GB	0.35
				32	Y		29.7GB	0.34
				32	N		OOM>40G	0.34
	use STFtrainer	512	Alpaca	8	Y	L4 23GB	17GB	1.4
				8	N		OOM>23G

	use STFtrainer	512	TatsuAlpaca	8	Y	3060 12GB	12GB	1.4


	use Trainer		Wikitext2	8	Y	L4	14.2GB	0.35
			–	-	-	3060	12GB	0.31
gpt2 (120M)	0.5GB FP32	1024	Wikitext2	8	N	3060 12GB	12GB	0.23
	pad left!	1024	–	-	-	-	12GB	0.23
		512	–	-	-	-	9.3GB	0.5
		300	–	-	-	-	6.7GB	0.8

The table is now sorted as requested. Let me know if there’s anything else to modify!

Source ChatGPT

計算 perplexity, 一般用 WikiText-2 因為比較小，品質比較好。

Comparison Table

Feature	WikiText-2	WikiText-103	enwik8
Size	~2M tokens	~103M tokens	~100M characters
Vocabulary Size	~33,000 tokens	~267,000 tokens	N/A (raw character-level)
Preprocessing	Minimal	Minimal	None (includes raw text)
Task Focus	Word-level modeling	Word-level modeling	Character-level modeling
Use Cases	Small-scale experiments	Large-scale pretraining	Byte/character-level tasks
Computational Cost	Low	High	Moderate

The WikiText-2, WikiText-103, and enwik8 datasets are commonly used for training and evaluating language models. Here’s a detailed comparison of their differences:

1. WikiText-2

Description:
- A smaller version of the WikiText-103 dataset.
- Contains high-quality, clean English text extracted from Wikipedia articles.
Characteristics:
- Size:
  - Training: ~2 million tokens
  - Validation: ~217,000 tokens
  - Test: ~245,000 tokens
- Vocabulary: ~33,000 unique tokens.
- Designed for quick experimentation and smaller-scale model development.
Use Case:
- Useful for testing new language model architectures or techniques without requiring significant computational resources.
Trade-offs:
- Smaller corpus limits its usefulness for pretraining large language models.
- Overfitting is a concern for larger models.

2. WikiText-103

Description:
- A larger, more comprehensive version of the WikiText-2 dataset.
- Extracted from English Wikipedia, with minimal preprocessing to maintain the natural structure of sentences and paragraphs.
Characteristics:
- Size:
  - Training: ~103 million tokens
  - Validation: ~217,000 tokens
  - Test: ~245,000 tokens
- Vocabulary: ~267,000 unique tokens.
- Retains long-term dependencies by preserving full article structure.
- Includes rare and less frequent words due to its larger size.
Use Case:
- Suitable for training larger language models.
- Useful for evaluating long-range dependency handling in language models.
Trade-offs:
- Requires more computational resources compared to WikiText-2.
- Slower for rapid prototyping.

3. enwik8

Description:
- A dataset derived from the first 100 million characters of an English Wikipedia XML dump.
- Focuses on character-level language modeling rather than token-based processing.
Characteristics:
- Size:
  - Training: ~90 million characters
  - Validation: ~5 million characters
  - Test: ~5 million characters
- Processed as raw text, meaning punctuation, HTML tags, and special characters are included.
- Designed for character-level tasks, unlike WikiText which is word-level.
Use Case:
- Character-level language model research and compression algorithms.
- Ideal for exploring models with byte-level representations or subword tokenization.
Trade-offs:
- Requires more steps to tokenize and preprocess compared to WikiText datasets.
- May not be as suitable for word-level language modeling tasks.

Summary

WikiText-2: Best for quick experiments and smaller models.
WikiText-103: Preferred for pretraining or evaluating word-level models on long-range dependencies.
enwik8: Ideal for character-level tasks or byte-level processing research.

Here are snippets from the datasets WikiText-2, WikiText-103, and Enwik8. These snippets are extracted based on the general characteristics of the datasets:

WikiText-2

Format: Plain text, tokenized with spaces.
Style: Contains diverse topics, but focuses on structured sentences with proper grammar.

 = Valkyria Chronicles III = 

 Valkyria Chronicles III: Unrecorded Chronicles is a tactical role-playing video game developed by Sega and released for the PlayStation Portable in Japan. 

WikiText-103

Format: Plain text, similar to WikiText-2 but much larger in size (over 100 million tokens).
Style: Same format and structure as WikiText-2 but covering a much broader range of topics and depth.

 = Natural Language Processing = 

 Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.

Enwik8

Format: Raw ASCII text (character-level).
Style: Includes Wikipedia content but focuses on unprocessed text (e.g., no tokenization, retains all special characters, and formatted as characters).

<text xml:space="preserve" bytes="473000">British Isles

The British Isles are a group of islands off the north-western coast of continental Europe that include the islands of Great Britain, Ireland and over six thousand smaller isles. There are two sovereign states located on the islands: the United Kingdom of Great Britain and Northern Ireland, and Ireland. The British Isles also include the Crown Dependencies of the Isle of Man and the Channel Islands (which are considered part of the British Isles by the UK government).

</text>

Key Observations

Tokenization:
- WikiText-2 and WikiText-103: Tokenized at the word level, stored as plain text.
- Enwik8: Character-level, raw ASCII format.
Structure:
- WikiText datasets are relatively clean and focus on proper Wikipedia articles.
- Enwik8 is raw, retains formatting and metadata like <text> tags.
Purpose:
- WikiText-2: Suitable for quick experiments and small-scale language modeling.
- WikiText-103: Large-scale language modeling with diverse and deep content.
- Enwik8: Focuses on character-level language modeling.

Let me know if you’d like a code snippet to directly view these datasets in Python.

Appendix

Hugging Face Perplexity Example

使用 GPT-2 ($n_{ctx} = 1024$) 計算 perplexity.

這裏使用 Hugging face 的 GPT2 和 WikiText dataset. 要事先 install 以下 packages.

pip install transformer
pip install datasets

from transformers import GPT2LMHeadModel, GPT2TokenizerFast

device = "cuda"
model_id = "gpt2-large"
model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

我們用 WikiText-2 dataset 評估 PPL.

from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")

import torch
from tqdm import tqdm

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

最後 ppl = 16.45.