本文在 Colab L4 GPU with 22GB DRAM 的 GPU 跑 Mistral-7B Model 改成 Microsoft Phi3-mini model 使用 prompt lookup decode 方法。
本文延續 [[2024-08-19-Prompt_Lookup_Decode]] 聚集在 coding. 有三種不同層次的 coding 整理如下:
| Orig PLD | HF Candidate | HF Xformers | |
|---|---|---|---|
| 層 | 最底層 | 中間層 | 最上層 |
| Call | find_candidate_pred_tokens | greedy_search_pld | model.generate |
| 使用時機 | 改變 retrieval 方法 | 不改 retrieval 修改 speculative or retrieval | 不修改直接用 |
| 目的 | Speed up | Robustness | Easiness |
Some observations and issues:
- Transformer prompt_lookup_num_tokens shows the best speed-up across different GPUs.
- There is negligible difference between your original code and my revised version using the CandidateGenerator class. There might be something wrong. I would appreciate it if you could take some time to check it. :)
- The Colab GPU T4 seems to behave strangely with both the original and revised versions, but it behaves normally in the Transformer prompt_lookup_num_tokens version. I’m not sure why.
| GPU | OLD | Orig PLD, t/s | Speed up | Candidate Generator, t/s | Speed up | Xformers PLD, t/s | Speed up |
|---|---|---|---|---|---|---|---|
| Colab L4 | On | 33.9 | 2.2 | 33.6 | 2.2 | 33.6 | 2.3 |
| Colab L4 | Off | 15.2 | 15.2 | 14.3 | |||
| Colab T4 | On | 11.1 | 0.7 | 14.3 | 1.0 | 33.2 | 2.2 |
| Colab T4 | Off | 15.6 | 14.6 | 14.9 | |||
| RTX3060 | On | 24.7 | 2.0 | 26.4 | 2.0 | 31.8 | 2.5 |
| RTX3060 | Off | 12.4 | 13.2 | 12.8 |
Prompt Lookup Decode Code Review
不論是觀念還是做法上,這是最簡單但是有效的一種方法。 因為非常簡單,很多 model hub 都已經整合 prompt lookup decode 在 model 中。
Find_Candidate_Pre_Tokens
1 | |
這段代碼定義了一個名為 find_candidate_pred_tokens 的函數,旨在在給定的標記 ID 序列 (input_ids) 中搜索特定序列(n-gram),並返回找到的序列之後的標記。
以下是代碼的詳解:
函數目的:
find_candidate_pred_tokens 函數的目的是識別 input_ids 中的一個標記序列(即 n-gram),並返回緊接著該序列的標記。該函數嘗試尋找最大的 n-gram(最多到 max_ngram_size),並返回一組跟隨識別出的 n-gram 的預測標記(num_pred_tokens)。
參數:
-
input_ids: 一個包含標記 ID 序列的 2D 張量。該函數假設第一維表示批次(儘管它僅適用於批次大小為 1),第二維表示序列長度。 -
max_ngram_size: 在input_ids中搜索的 n-gram 的最大大小。該函數將首先查找這個長度的序列,然後逐步減小大小直到找到匹配。 -
num_pred_tokens: 在找到 n-gram 後要返回的標記數量。
代碼解釋:
-
輸入驗證:
- 函數首先驗證
max_ngram_size和num_pred_tokens是否為正值,以及max_ngram_size是否不大於input_ids的長度。如果違反了其中任何一個條件,則函數會引發ValueError.
- 函數首先驗證
-
主循環:
-
然後函數進入一個循環,從
max_ngram_size開始向下迭代到 1。此循環旨在尋找與input_ids中的一個序列匹配的最大的 n-gram。 -
n-gram 提取:對於每個
ngram_size, 它從input_ids中提取最後的ngram_size個標記並將這些標記轉換為一個張量 (ngram_tensor)。 -
滑動窗口:函數使用
input_ids.unfold(dimension=1, size=ngram_size, step=1)創建與ngram_size相同大小的“滑動窗口”標記。這會從input_ids生成重疊窗口(子序列)。 -
匹配查找:然後檢查這些窗口中的哪一個與提取出的
ngram_tensor匹配。結果存儲在布林值張量matches中。 -
匹配索引:如果找到匹配,則將這些匹配的位置索引存儲在
match_indices.
-
-
返回預測標記:
-
函數遍歷已找到的匹配索引,以識別 n-gram 的有效延續。它計算跟隨匹配 n-gram 的標記起始和結束索引(即
start_idx和end_idx)。 -
如果延續落在
input_ids的範圍內且不與 n-gram 本身重疊,它將返回預測標記(從 `start_idx 到 end_idx)。
-
-
無匹配情況:
- 如果嘗試所有 n-gram 大小後仍未找到任何匹配,則該函數將返回一個空張量。
顏色和下劃線:
在代碼結尾處,有一些常量被定義為顏色代碼和下劃線(例如,『COLORS』和『UNDERLINE』)。這些未被用於函數中,看起來是用來格式化終端文本,但對於『find_candidate_pred_tokens』函數而言並不相關。
1 |
|
這段代碼定義了一個名為 find_candidate_pred_tokens 的函數,旨在在給定的標記 ID 序列 (input_ids) 中搜索特定序列(n-gram),並返回找到的序列之後的標記。
以下是代碼的詳解:
函數目的:
find_candidate_pred_tokens 函數的目的是識別 input_ids 中的一個標記序列(即 n-gram),並返回緊接著該序列的標記。該函數嘗試尋找最大的 n-gram(最多到 max_ngram_size),並返回一組跟隨識別出的 n-gram 的預測標記(num_pred_tokens)。
參數:
-
input_ids: 一個包含標記 ID 序列的 2D 張量。該函數假設第一維表示批次(儘管它僅適用於批次大小為 1),第二維表示序列長度。 -
max_ngram_size: 在input_ids中搜索的 n-gram 的最大大小。該函數將首先查找這個長度的序列,然後逐步減小大小直到找到匹配。 -
num_pred_tokens: 在找到 n-gram 後要返回的標記數量。
代碼解釋:
-
輸入驗證:
- 函數首先驗證
max_ngram_size和num_pred_tokens是否為正值,以及max_ngram_size是否不大於input_ids的長度。如果違反了其中任何一個條件,則函數會引發ValueError.
- 函數首先驗證
-
主循環:
-
然後函數進入一個循環,從
max_ngram_size開始向下迭代到 1。此循環旨在尋找與input_ids中的一個序列匹配的最大的 n-gram。 -
n-gram 提取:對於每個
ngram_size, 它從input_ids中提取最後的ngram_size個標記並將這些標記轉換為一個張量 (ngram_tensor)。 -
滑動窗口:函數使用
input_ids.unfold(dimension=1, size=ngram_size, step=1)創建與ngram_size相同大小的“滑動窗口”標記。這會從input_ids生成重疊窗口(子序列)。 -
匹配查找:然後檢查這些窗口中的哪一個與提取出的
ngram_tensor匹配。結果存儲在布林值張量matches中。 -
匹配索引:如果找到匹配,則將這些匹配的位置索引存儲在
match_indices.
-
-
返回預測標記:
-
函數遍歷已找到的匹配索引,以識別 n-gram 的有效延續。它計算跟隨匹配 n-gram 的標記起始和結束索引(即
start_idx和end_idx)。 -
如果延續落在
input_ids的範圍內且不與 n-gram 本身重疊,它將返回預測標記(從 `start_idx 到 end_idx)。
-
-
無匹配情況:
- 如果嘗試所有 n-gram 大小後仍未找到任何匹配,則該函數將返回一個空張量。
顏色和下劃線:
在代碼結尾處,有一些常量被定義為顏色代碼和下劃線(例如,『COLORS』和『UNDERLINE』)。這些未被用於函數中,看起來是用來格式化終端文本,但對於『find_candidate_pred_tokens』函數而言並不相關。
Greedy_Search_Pld Code
1 |
|
這段代碼定義了一個名為 greedy_search_pld 的方法,用於使用自定義的貪婪搜索策略生成文本。此方法結合了一個草稿預測機制,它試圖將候選標記與模型的預測進行匹配,並以顏色編碼的格式輸出以便視覺化。
參數:
input_ids:表示輸入提示的標記 ID 序列。logits_processor: (可選)用於修改 logits 的處理器列表(例如,過濾某些標記)。stopping_criteria: (可選)決定何時停止生成的標準列表。max_length: (可選)生成序列的最大長度。pad_token_id: (可選)填充標記的 ID。eos_token_id: (可選)結束序列標記的 ID。可以是一個 ID 或一個 ID 列表。output_attentions: (可選)是否返回注意力分數。output_hidden_states: (可選)是否返回隱藏狀態。output_scores: (可選)是否返回預測的分數。return_dict_in_generate: (可選)是否返回包含詳細生成信息的字典。synced_gpus: 是否在生成過程中同步 GPU(對於多 GPU 設置)。streamer: (可選)在生成過程中流式傳輸標記輸出的流媒介。draft_matching_window_size: 用於匹配候選標記(n 元組)的窗口大小。draft_num_candidate_tokens: 在預測下一個標記時要考慮的標記數量。- **
print_output** : 一個布爾值,指示是否以顏色編碼打印生成的輸出。 - **
**model_kwargs** : 模型的其他參數。
主要步驟:
-
初始化:
- 函數首先根據提供的值或模型配置初始化
stopping_criteria,pad_token_id, 和eos_token_id. - 如果提供了,則將
eos_token_id轉換為張量。
- 函數首先根據提供的值或模型配置初始化
-
token生成循環:
- 方法進入一個循環,每次生成一個或一小組token,直到滿足停止條件為止。
- 使用
find_candidate_pred_tokens函數在輸入 (input_ids) 中搜索token序列並預測下一個可能的token (candidate_pred_tokens)。 - 如果未找到匹配項,則默認預測 ID 為
100.
-
準備候選token:
- 將候選token附加到
input_ids, 並在擴展序列上運行模型,以生成下一個token logits (new_logits)。 - 然後從 logits 中選擇最可能的token (
selected_tokens)。
- 將候選token附加到
-
匹配預測token:
- 方法比較
selected_tokens 和 candidate tokens 並確定有多少匹配 (n_matches`) 。 - 然後只選擇有效匹配token (
valid_tokens) 附加到input_ids.
- 方法比較
-
打印輸出
- 如果啟用了
print_output, 方法將以顏色編碼段打印新生成文本。每段新生成文本都會獲得不同顏色以便視覺化。
- 如果啟用了
-
停止條件
- 循環檢查生成的token是否包含 `eos_token_id 或者停止條件是否滿足。如果任一條件為真,則循環中斷,並停止生成。
-
返回結果
- 如果設置了 `return_dict_in_generate 为 True, 方法將返回一個字典(GreedySearchDecoderOnlyOutput),其中包含生成序列和任何請求的附加數據(例如分數)。
- 如果沒有,它僅返回生成序列(input_ids)。
主要特徵:
- 貪婪搜索
CandidateGenerator Class
HuggingFace Transformer
HuggingFace 已經把 prompt lookup decode 整合到 transformer 的 option. 就是 prompt_lookup_num_tokens. Default 是 0 (off).
generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer, prompt_lookup_num_tokens=10)
1 | |
This code defines a function called find_candidate_pred_tokens that is designed to search for specific sequences (n-grams) within a given sequence of token IDs (input_ids) and return the following tokens after the found sequence. The function is decorated with @torch.no_grad(), which means that PyTorch won’t track operations for gradient calculation, saving memory and computation during inference.
Here’s a breakdown of the code:
Function Purpose:
The purpose of the find_candidate_pred_tokens function is to identify a sequence of tokens (an n-gram) within input_ids and return the tokens that immediately follow this sequence. The function tries to find the largest possible n-gram (up to max_ngram_size) and returns a set of predicted tokens (num_pred_tokens) that follow the identified n-gram.
Parameters:
-
input_ids: A 2D tensor containing sequences of token IDs. The function assumes the first dimension represents the batch (though it only works with a batch size of 1), and the second dimension represents the sequence length. -
max_ngram_size: The maximum size of the n-gram to search for withininput_ids. The function will look for sequences of this length first, then reduce the size until it finds a match. -
num_pred_tokens: The number of tokens to return after the n-gram is found.
Code Explanation:
-
Input Validation:
- The function starts by validating that
max_ngram_sizeandnum_pred_tokensare positive and thatmax_ngram_sizeis not larger than the length ofinput_ids. If any of these conditions are violated, the function raises aValueError.
- The function starts by validating that
-
Main Loop:
-
The function then enters a loop, iterating from the
max_ngram_sizedown to 1. The purpose of this loop is to find the largest n-gram that matches a sequence ininput_ids. -
ngram Extraction: For each
ngram_size, it extracts the lastngram_sizetokens frominput_idsand converts this list of tokens into a tensor (ngram_tensor). -
Sliding Windows: The function creates “sliding windows” of tokens of the same size as
ngram_sizeacross theinput_idsusinginput_ids.unfold(dimension=1, size=ngram_size, step=1). This generates overlapping windows (subsequences) from theinput_ids. -
Match Finding: It then checks which of these windows match the extracted
ngramby comparing each window with thengram_tensor. The result is stored inmatches, a tensor of boolean values. -
Match Indices: If matches are found, the indices of these matches are stored in
match_indices.
-
-
Returning Predicted Tokens:
-
The function iterates over the found match indices to identify a valid continuation of the n-gram. It computes the starting and ending indices (
start_idxandend_idx) for the tokens following the matched n-gram. -
If the continuation falls within the bounds of
input_idsand doesn’t overlap with the n-gram itself, it returns the predicted tokens (fromstart_idxtoend_idx).
-
-
No Match Case:
- If no match is found after trying all n-gram sizes, the function returns an empty tensor.
Colors and Underline:
- At the end of the code, there are some constants defined for color codes and underlining (e.g.,
COLORSandUNDERLINE). These are not used in the function and seem to be for formatting text in a terminal, but they are not relevant to thefind_candidate_pred_tokensfunction.
1 |
|
This code defines a method called greedy_search_pld for generating text from a language model using a customized greedy search strategy. This method incorporates a draft prediction mechanism, where it tries to match candidate tokens with the model’s predictions and outputs them in a color-coded format for visualization.
Parameters:
input_ids: The sequence of token IDs representing the input prompt.logits_processor: (Optional) A list of processors to modify the logits (e.g., filtering certain tokens).stopping_criteria: (Optional) A list of criteria that determine when the generation should stop.max_length: (Optional) The maximum length of the generated sequence.pad_token_id: (Optional) The ID of the padding token.eos_token_id: (Optional) The ID of the end-of-sequence token. It can be a single ID or a list of IDs.output_attentions: (Optional) Whether to return attention scores.output_hidden_states: (Optional) Whether to return hidden states.output_scores: (Optional) Whether to return the scores of the predictions.return_dict_in_generate: (Optional) Whether to return a dictionary with detailed generation information.synced_gpus: Whether to synchronize GPUs during generation (for multi-GPU setups).streamer: (Optional) A streamer for streaming token outputs during generation.draft_matching_window_size: The window size used for matching candidate tokens (n-grams).draft_num_candidate_tokens: The number of tokens to consider when predicting the next tokens.print_output: A boolean indicating whether to print the generated output with color coding.**model_kwargs: Additional arguments for the model.
Main Steps:
-
Initialization:
- The function starts by initializing the
stopping_criteria,pad_token_id, andeos_token_idbased on either provided values or the model’s configuration. - It converts the
eos_token_idto a tensor if it is provided.
- The function starts by initializing the
-
Loop for Token Generation:
- The method enters a loop where it generates tokens one by one or in small groups until it meets the stopping criteria.
- The
find_candidate_pred_tokensfunction is used to search for sequences of tokens in the input (input_ids) and predict the next possible tokens (candidate_pred_tokens). - If no match is found, it defaults to predicting a token with ID
100.
-
Prepare Candidate Tokens:
- The candidate tokens are appended to the
input_ids, and the model is run on this extended sequence to generate the next token logits (new_logits). - The function then selects the most likely tokens (
selected_tokens) from the logits.
- The candidate tokens are appended to the
-
Matching Predicted Tokens:
- The method compares the
selected_tokenswith the candidate tokens and determines how many of them match (n_matches). - It then selects only the valid matching tokens (
valid_tokens) to append to theinput_ids.
- The method compares the
-
Print the Output:
- If
print_outputis enabled, the method prints the newly generated text in color-coded segments. Each segment of newly generated text gets a different color for visualization.
- If
-
Stopping Conditions:
- The loop checks whether the generated tokens include the
eos_token_idor if the stopping criteria are met. If either condition is true, the loop breaks, and the generation stops.
- The loop checks whether the generated tokens include the
-
Return the Result:
- If
return_dict_in_generateis set toTrue, the method returns a dictionary (GreedySearchDecoderOnlyOutput) with the generated sequence and any requested additional data (e.g., scores). - If not, it returns just the generated sequence (
input_ids).
- If
Key Features:
- Greedy Search: The method implements a greedy search, which means it always selects the most likely next token at each step.
- Draft Prediction: It incorporates a mechanism to predict and match tokens before finalizing them.
- Color-Coded Output: The generated text is printed with different colors for easy visualization of the newly generated segments.
Reference
X: Prompt Lookup Decode demo: https://twitter.com/joao_gante/status/1747322413006643259 HuggingFace: https://huggingface.co/docs/transformers/generation_strategies