LLM 最大的速度瓶頸是的 memory footprint 以及 generation mode 遇到的 memory bandwidth 瓶頸。 加速有多種方式,但是萬變不離其宗都是把 autoregressive 的 sequential decode 變成 parallel decode (verification). 目前常見的做法:
-
Speculative decode: 利用小 (draft model) 和大 (native model) 模型達成加速 [[2023-12-04-Speculative_Decode blog]]。 -
Medusa decode: 利用在 draft model 的 multi-heads 的 information 預測達成加速 [[2023-12-10-Medusa_Memory blog]]。 -
Lookahead decode: 利用數學的解聯立方程式的迭代法 (Jacob or GS-Jacob) 達成加速 [[2023-12-04-Lookahead_Decode blog]]。 - Retrieval decode: 利用靜態或動態的 “word bank” 的 n-gram 特性,不自己產生 token,達成加速。例如 Prompt-Lookup-Decode, REST.
- Encode (prompt, parallel) + decode (generative): 利用 prompt parallel mode 的 hint 給 decode.
本文在 Colab L4 GPU with 22GB DRAM 的 GPU 跑 Mistral-7B Model
Prompt Lookup Decode
不論是觀念還是做法上,這是最簡單但是有效的一種方法。 因為非常簡單,很多 model hub 都已經整合 prompt lookup decode 在 model 中。
HuggingFace Transformer
HuggingFace 已經把 prompt lookup decode 整合到 transformer 的 option. 就是 prompt_lookup_num_tokens. Default 是 0 (off).
generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer, prompt_lookup_num_tokens=10)
```python from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer import torch
model_name_or_path = “TheBloke/OpenHermes-2.5-Mistral-7B-AWQ”
tokenizer = AutoTokenizer.from_pretrained(“teknium/OpenHermes-2.5-Mistral-7B”) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map=”auto”)
chat = [ { “role”: “system”, “content”: “You are Hermes 2, an unbiased AI assistant, and you always answer to the best of your ability.” }, { “role”: “user”, “content”: ( “You are given a partial and unparsed scientific article, please read it carefully and complete the “ f”request below.{article}Please summarize the article in 5 sentences.” ) }, ] processed_chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) input_ids = tokenizer(processed_chat, return_tensors=’pt’).to(model.device)
streamer = TextStreamer(tokenizer)
baseline without prompt lookup decode
generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer)
with prompt lookup decode
generation_output = model.generate(**input_ids, do_sample=False, max_new_tokens=512, streamer=streamer, prompt_lookup_num_tokens=10)
max_memory = torch.cuda.max_memory_allocated(model.device) print(“Max memory (MB): “, max_memory * 1e-6) new_tokens = generation_output.shape[1] - input_ids.input_ids.shape[1] print(“Throughput (tokens/sec): “, new_tokens / (start_event.elapsed_time(end_event) * 1.0e-3)) ```
Model Size and GPU Configuration
- Mixtrial-7B: 4.15 GB (INT4 precision)
- GPU L4, 22.5GB.
| w/o PLD | w/ PLD | Comment | |
|---|---|---|---|
| Max Memory (MB) | 6386 | 6389 | the same |
| Throughput (tokens/sec) | 26.44 | 10.56 | 2.5X |
Llama.cpp
https://github.com/ggerganov/llama.cpp/tree/master/examples/lookup
原理和 implementation
Reference
X: Prompt Lookup Decode demo: https://twitter.com/joao_gante/status/1747322413006643259 HuggingFace: https://huggingface.co/docs/transformers/generation_strategies