Llama Quantization

Source

【transformers】Llama 量化-bitsandbytes - 知乎 (zhihu.com)

Llama Quantization

FP16 -> INT8 or FP8 -> INT4 or FP4
Attention vs. FFN

Attention 已經是必備的 core network. 相較於 CNN, attention 最大的問題是 memory bandwidth.

主要在計算 K, Q 的 correlation, 以及 softmax. 以下是 GPT1/2/3 的參數。

下圖應該畫錯了！ GPT 應該是 decoder only (右邊)。所以對應的方塊圖是沒有 encoder (左邊)，只有 decoder (右邊)。所以打叉的地方相反。BERT 纔是 encoder only (左邊)。不過兩者的架構非常類似。不過 decoder only 架構 output 會 shift right 再接回 input, 稱爲 auto-regression.

Transformer Llama Quantization

pip install accelerate

pip install bitsandbytes

研究一下 Transformers 中量化。

The bitsandbytes integration supports 8bit and 4bit precision data types, which are useful for loading large models because it saves memory (see the bitsandbytes integration guide to learn more). Add the load_in_8bit or load_in_4bit parameters to [~PreTrainedModel.from_pretrained] and set device_map="auto" to effectively distribute the model to your hardware https://huggingface.co/blog/zh/hf-bitsandbytes-integration

transformers 目前支持兩種量化方式：bitsandbytes 和 autogptq，這裏我們先看下bitsandbytes

1、PreTrainedModel 基類

代碼位置：transformers/src/transformers/models/llama/modeling_llama.py

transformers 中的模型如果使用bitsandbytes量化，只需要在 from_pretrained() 中添加相應的字段，舉例子如下：

from transformers import AutoModelForCausalLM

model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True)

這裏就分別使用 int8 和 int4 進行量化了。

我們看下類的繼承關係，如下：

OPTConfig -> PretrainedConfig

最終又回到了這個基類中，因此我看看下這個基類中 from_pretrained中關於量化的一些實現，這裏： 1. 因爲 from_pretrained() 很長，因此本章只關注了量化部分的代碼 2. 量化方法中先只能看bitsandbytes相關的代碼，對於最近的引入的 autogptq先不關注

2、初始化

load_in_8bit = kwargs.pop("load_in_8bit", False)
load_in_4bit = kwargs.pop("load_in_4bit", False)

首先嚐試從 kwargs 中獲取 load_in_8bit 和 load_in_4bit，默認爲 False

if quantization_config is None and (load_in_8bit or load_in_4bit):
    quantization_method_from_args = QuantizationMethod.BITS_AND_BYTES
    quantization_config, kwargs = BitsAndBytesConfig.from_dict(
        config_dict={"load_in_8bit": load_in_8bit, "load_in_4bit": load_in_4bit},
        return_unused_kwargs=True,
        **kwargs,
    )
elif quantization_method_from_args == QuantizationMethod.BITS_AND_BYTES:
    load_in_8bit = quantization_config.load_in_8bit
    load_in_4bit = quantization_config.load_in_4bit

    ... ...

然後在 quantization_config 爲 None 的情況下，通過 BitsAndBytesConfig 創建量化配置，要不然quantization_config 中獲取相關配置。

if load_in_8bit or load_in_4bit:
    if not (is_accelerate_available() and is_bitsandbytes_available()):
        raise ImportError(...)

    if torch_dtype is None:
        # We force the `dtype` to be float16, this is a requirement from `bitsandbytes`
        logger.info(
            f"Overriding torch_dtype={torch_dtype} with `torch_dtype=torch.float16` due to "
            "requirements of `bitsandbytes` to enable model loading in 8-bit or 4-bit. "
            "Pass your own torch_dtype to specify the dtype of the remaining non-linear layers or pass"
            " torch_dtype=torch.float16 to remove this warning."
        )
        torch_dtype = torch.float16

    if device_map is None:
        if torch.cuda.is_available():
            device_map = {"": torch.cuda.current_device()}
        else:
            raise RuntimeError("No GPU found. A GPU is needed for quantization.")
        logger.info(
            "The device_map was not initialized."
            "Setting device_map to {'':torch.cuda.current_device()}."
            "If you want to use the model for inference, please set device_map ='auto' "
        )
        if low_cpu_mem_usage is None:
            low_cpu_mem_usage = True

    if from_tf or from_flax:
        raise ValueError(...)

然後是一些簡單的判斷，其中：

會對 accelerate 和 bitsandbytes 包的存在進行判斷；
會對 torch_dtype 進行判斷，因爲 bitsandbytes 對類型有一定要求；
不支持 tf 和 flax ；
等等

3、實例化模型

# Instantiate model.
init_contexts = [no_init_weights(_enable=_fast_init)]

if is_deepspeed_zero3_enabled():
    import deepspeed

    logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
elif load_in_8bit or load_in_4bit or low_cpu_mem_usage:
    init_contexts.append(init_empty_weights())

with ContextManagers(init_contexts):
    model = cls(config, *model_args, **model_kwargs)

忽略 deepspeed_zero3，當使用 load_in_8bit or load_in_4bit 的時候，會調用 accelerate.init_empty_weights來創建一個空的、未初始化的模型權重。這個方法通常用於在加載預訓練模型之前初始化模型權重，以提高加載速度。

4、量化版網絡結構替換

4.1 第一部分

if load_in_8bit or load_in_4bit:
    from .utils.bitsandbytes import get_keys_to_not_convert, replace_with_bnb_linear

    llm_int8_skip_modules = quantization_config.llm_int8_skip_modules
    load_in_8bit_fp32_cpu_offload = quantization_config.llm_int8_enable_fp32_cpu_offload
    ... ...

    # We keep some modules such as the lm_head in their original dtype for numerical stability reasons
    if llm_int8_skip_modules is None:
        modules_to_not_convert = get_keys_to_not_convert(model)
    else:
        modules_to_not_convert = llm_int8_skip_modules

    if not isinstance(modules_to_not_convert, list):
        modules_to_not_convert = [modules_to_not_convert]

    modules_to_not_convert.extend(keep_in_fp32_modules)

這段代碼的主要目的是找到不需要轉化 int8 的module，其中： 1. 首先從 quantization_config 中查找； 2. 如果沒有對應的 value，則利用 get_keys_to_not_convert 這個方法來獲取

4.1.1 get_keys_to_not_convert

代碼位置：src/transformers/integrations/bitsandbytes.py

def get_keys_to_not_convert(model):
    # Create a copy of the model and tie the weights, then
    # check if it contains tied weights
    tied_model = deepcopy(model)  # this has 0 cost since it is done inside `init_empty_weights` context manager`
    tied_model.tie_weights()

    tied_params = find_tied_parameters(tied_model)
    # For compatibility with Accelerate < 0.18
    if isinstance(tied_params, dict):
        tied_keys = sum(list(tied_params.values()), []) + list(tied_params.keys())
    else:
        tied_keys = sum(tied_params, [])
    has_tied_params = len(tied_keys) > 0

    # If there is not tied weights, we want to keep the lm_head（output_embedding) in full precision
    if not has_tied_params:
        output_emb = model.get_output_embeddings()
        if output_emb is not None:
            list_last_module = [name for name, module in model.named_modules() if id(module) == id(output_emb)]
            return list_last_module

    # otherwise, no tied weights, no output embedding defined, simply keep the last module in full precision
    list_modules = list(model.named_parameters())
    list_last_module = [list_modules[-1][0]]
    # add last module together with tied weights
    intersection = set(list_last_module) - set(tied_keys)
    list_untouched = list(set(tied_keys)) + list(intersection)

    # remove ".weight" from the keys
    names_to_remove = [".weight", ".bias"]
    filtered_module_names = []
    for name in list_untouched:
        for name_to_remove in names_to_remove:
            if name_to_remove in name:
                name = name.replace(name_to_remove, "")
        filtered_module_names.append(name)

    return filtered_module_names

這段代碼的主要目的有兩個： 1. 第一個是出於數值穩定性的目的希望一些特殊層如lm_head 不被量化； 2. 第二個是使得綁定權重（tied weight）不被量化。

代碼流程主要如下：

首先，函數創建了模型的一個深拷貝 tied_model，然後嘗試通過 tie_weights() 方法將權重綁定。這是爲了檢查是否存在綁定（tied）的權重。這個方法同樣在 PreTrainedModel中有定義。主要實現的功能是在沒有特殊情況下，會對綁定 input_embeddings 和 output_embedding；如果有特殊設置，也會對其他層進行綁定；
1. 綁定權重是指多個模塊共享相同的權重參數
接着調用 find_tied_parameters()函數檢查是否存在綁定的參數（tied parameters），如果有綁定的參數，則將其對應的鍵（keys）存儲在 tied_keys 變量中；
如果沒有綁定的參數，函數會嘗試獲取模型的輸出嵌入lm_head(output_embedding)層。輸出嵌入通常用於語言模型的最後一層。如果輸出嵌入層存在，則返回它的鍵作爲要保留在全精度的模塊；
如果存在綁定的參數tied_keys（這塊代碼中的註釋有點問題），函數會選擇保留模型的最後一個模塊（通常是最後一層神經網絡層）在全精度。然後，它將最後一個模塊的鍵（keys）與tied_keys合併，以確定要在全精度的模塊；
最後，函數會移除模塊鍵中的 “.weight” 和 “.bias” 後綴，以保證返回的鍵不包含這些後綴。

4.1.2 find_tied_parameters

find_tied_parameters 是 accelerate 包中的函數，主要功能代碼如下面：

if named_parameters is None:
    named_parameters = {n: p for n, p in model.named_parameters()}
else:
    # A tied parameter will not be in the full `named_parameters` seen above but will be in the `named_parameters`
    # of the submodule it belongs to. So while recursing we track the names that are not in the initial
    # `named_parameters`.
    for name, parameter in model.named_parameters():
        full_name = name if prefix == "" else f"{prefix}.{name}"
        if full_name not in named_parameters:
            # When we find one, it has to be one of the existing parameters.
            for new_name, new_param in named_parameters.items():
                if new_param is parameter:
                    if new_name not in result:
                        result[new_name] = []
                    result[new_name].append(full_name)

for name, child in model.named_children():
    child_name = name if prefix == "" else f"{prefix}.{name}"
    find_tied_parameters(child, named_parameters=named_parameters, prefix=child_name, result=result)

如果 named_parameters 爲 None，則首先將模型的所有參數名稱和參數存儲在 named_parameters 字典中，其中鍵是參數名稱，值是參數本身；
接下來，函數開始遞歸地檢查模型的各個子模塊。對於每個子模塊，它會生成一個完整的名稱 full_name，並檢查該名稱是否在初始的 named_parameters 中。如果不在初始的 named_parameters 中，說明這是一個綁定的參數，需要將其添加到 result 中。

4.2 第二部分

model = replace_with_bnb_linear(
    model, modules_to_not_convert=modules_to_not_convert, quantization_config=quantization_config
)
# training in 8-bit is only available in 0.37.0+ but a major bug in 8-bit optimizers was fixed in 0.41.1
model._is_quantized_training_enabled = version.parse(
    importlib.metadata.version("bitsandbytes")
) >= version.parse("0.41.1")

model.config.quantization_config = quantization_config
model.is_8bit_serializable = is_8bit_serializable

這段代碼的目的是將網絡中所有的torch.nn.Linear 層替換成bitsandbytes中定義的量化版本bnb.nn.Linear8bit，從而可以進行 in8 混合精度。

int8 混合精度的實現是將矩陣乘積拆成兩部分進行： 1. 以 fp16 精度進行的離羣值矩陣乘積，計算量佔比（0.01%）； 2. 以 int8 精度進行的常規矩陣乘積，計算量佔比（99.9%）；

相應的方法可以看這個：https://101.dev/t/transformer-8-hugging-face-transformers-accelerate-bitsandbytes/975/1#matmul-6

4.2.1 _replace_with_bnb_linear

代碼位置：src/transformers/integrations/bitsandbytes.py

if (isinstance(module, nn.Linear) or isinstance(module, Conv1D)) and name not in modules_to_not_convert:
    # Check if the current key is not in the `modules_to_not_convert`
    if not any(key in ".".join(current_key_name) for key in modules_to_not_convert):
        with init_empty_weights():
            if isinstance(module, Conv1D):
                in_features, out_features = module.weight.shape
            else:
                in_features = module.in_features
                out_features = module.out_features

            if quantization_config.quantization_method() == "llm_int8":
                model._modules[name] = bnb.nn.Linear8bitLt(
                    in_features,
                    out_features,
                    module.bias is not None,
                    has_fp16_weights=quantization_config.llm_int8_has_fp16_weight,
                    threshold=quantization_config.llm_int8_threshold,
                )
                has_been_replaced = True
            else:
               ... ...
            # Store the module class in case we need to transpose the weight later
            model._modules[name].source_cls = type(module)
            # Force requires grad to False to avoid unexpected errors
            model._modules[name].requires_grad_(False)

核心代碼如上：

通過遞歸遍歷模型的各個子模塊來查找需要替換的模塊；
對於每個子模塊，方法檢查它是否是線性層（nn.Linear）或 1D 卷積層（Conv1D），並且該模塊不在 modules_to_not_convert 列表中。如果滿足條件，就需要替換該模塊；
替換後，將新的線性層添加到模型中，並設置相應的屬性，如是否有偏置、是否使用 16 位權重、閾值等

5、加載量化權重

(
    model,
    missing_keys,
    unexpected_keys,
    mismatched_keys,
    offload_index,
    error_msgs,
) = cls._load_pretrained_model(
    model,
    state_dict,
    ...
    is_quantized=(load_in_8bit or load_in_4bit),
    ...

代碼通過 _load_pretrained_model加載權重，入口參數通過 (load_in_8bit or load_in_4bit)賦值給 is_quantized。

如果 is_quantized=True，則內部在進行權重搬運到 device 的時候會調用 set_module_quantized_tensor_to_device 方法。該方法內部會對 int8 或 int4 量化產生不同的 params，如下：

if is_8bit:
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs).to(device)
elif is_4bit:
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)

module._parameters[tensor_name] = new_value

6、總結

最早的大語言模型 (ChatGPT2/3, Llama) 的 context length 只有 768/1K/2K tokens. 在應用爲什麼大語言模型需要 long context (8K or longer)? 簡單說有兩點

處理長文本輸入。例如一篇長的文章做 summary.
對話的記憶長度。例如長的對話有上下文的 context. Question and Answering

因此實務上，long context (4K/8K or longer) 對於應用非常有用。

How to Make Long Context

Naïve Way

最簡單的方法就是設定長的 input token length, 就是以下的 $n_{ctx}$ 例如從 1K/2K 改成 4K/8K/16K/32K. 幾個問題：

整體的 parameter number 並沒有隨著 $n_{ctx}$ 而增加！只有在查字典和 position encoder 增加一些 parameter 。 -> good things 如果我們知道如何 fine-tune 原來的 model 就可以從 1K/2K to 4K-32K!!!!! 不過要修改 position encoder!!!
但是 internal matrix computation 隨著 $n_{ctx}$ 呈現綫性增加。
cache size (of activations) 隨著 $n_{ctx}$ 呈現綫性增加。

另外的問題是需要從新訓練 LLM 使用更長的 context. 例如從目前 Llama2-7B 只有 2K context, 如果要更長 context 就需要從新用更長的 text training. Big effort to train from scratch!!!

Ideal Goal

是否有方法

只要 fine-tune 原來的 1K/2K 的 LLM model parameter 就可以改成 4K-32K, 不過要修改 position encoder.
最好內部的計算和 activation 的 cache size 不需要增加？ too good to be true!!!!

Address Goal 1 (Fine-tune instead of pre-train)

1. Fine-tune use 32K

2. RoPE (Rotation PE) + flash attention : simpler than fine-tune

https://www.youtube.com/watch?v=UPYf3jxcMVY&ab_channel=1littlecoder

LLaMA2上下文長度暴漲至100萬tokens，只需調整1個超參數 (baidu.com)

目前的Transformer位置編碼方法，有絕對位置編碼（將位置信息融入到輸入）、相對位置編碼（將位置信息寫入attention分數計算）和旋轉位置編碼幾種。其中，最火熱的要屬旋轉位置編碼，也就是RoPE了。

RoPE通過絕對位置編碼的形式，實現了相對位置編碼的效果，但與相對位置編碼相比，又能更好地提升大模型的外推潛力。

如何進一步激發採用RoPE位置編碼的大模型的外推能力，也成爲了最近不少研究的新方向。

這些研究，又主要分爲限制注意力和調整旋轉角兩大流派。

限制注意力的代表研究包括ALiBi、xPos、BCA等。最近MIT提出的StreamingLLM，可以讓大模型實現無限的輸入長度（但並不增加上下文窗口長度），就屬於這一方向的研究類型。

調整旋轉角的工作則更多，典型代表如線性內插、Giraffe、Code LLaMA、LLaMA2 Long等都屬於這一類型的研究。

以Meta最近爆火的LLaMA2 Long研究爲例，它就提出了一個名叫RoPE ABF的方法，通過修改一個超參數，成功將大模型的上下文長度延長到3.2萬tokens。

這個超參數，正是Code LLaMA和LLaMA2 Long等研究找出的“開關”——

旋轉角底數（base）。

只需要微調它，就可以確保提升大模型的外推表現。

但無論是Code LLaMA還是LLaMA2 Long，都只是在特定的base和續訓長度上進行微調，使得其外推能力增強。

是否能找到一種規律，確保所有用了RoPE位置編碼的大模型，都能穩定提升外推表現？

來自復旦大學和上海AI研究院的研究人員，針對這一問題進行了實驗。

他們先是分析了影響RoPE外推能力的幾種參數，提出了一種名叫臨界維度（Critical Dimension）的概念，隨後基於這一概念，總結出了一套RoPE外推的縮放法則（Scaling Laws of RoPE-based Extrapolation）。

只需要應用這個規律，就能確保任意基於RoPE位置編碼大模型都能改善外推能力。

先來看看臨界維度是什麼。

對此論文認爲，旋轉角底數更小，能讓更多的維度感知到位置信息，旋轉角底數更大，則能表示出更長的位置信息。

基於這一規律，可以根據不同預訓練和續訓文本長度，來直接計算出大模型的外推表現，換言之就是預測大模型的支持的上下文長度。

反之利用這一法則，也能快速推導出如何最好地調整旋轉角底數，從而提升大模型外推表現。

作者針對這一系列任務進行了測試，發現實驗上目前輸入10萬、50萬甚至100萬tokens長度，都可以保證，無需額外注意力限制即可實現外推。

與此同時，包括Code LLaMA和LLaMA2 Long在內的大模型外推能力增強工作都證明瞭這一規律是確實合理有效的。

這樣一來，只需要根據這個規律“調個參”，就能輕鬆擴展基於RoPE的大模型上下文窗口長度、增強外推能力了。

Address Goal 2 (Reduce computation and activation)

1. Cache size optimization

就是使用 KV cache + flash decoder? to break the 32K into 2K + 2K …. chunks?

Llama Quantization

Source

Llama Quantization

Transformer Llama Quantization

1、PreTrainedModel 基類

2、初始化

3、實例化模型

4、量化版網絡結構替換

4.1 第一部分

4.1.1 get_keys_to_not_convert

4.1.2 find_tied_parameters

4.2 第二部分

4.2.1 _replace_with_bnb_linear

5、加載量化權重

6、總結

How to Make Long Context

Naïve Way

Ideal Goal

Address Goal 1 (Fine-tune instead of pre-train)

1. Fine-tune use 32K

2. RoPE (Rotation PE) + flash attention : simpler than fine-tune

Address Goal 2 (Reduce computation and activation)

1. Cache size optimization

2. MGQ (Multiple Group Query)

Flash Decoder

Appendix