MMLU on GPT

Source

[Running Ollama on Google Colab (Free Tier): A Step-by-Step Guide

by Anoop Maurya

May, 2024

Medium](https://medium.com/@mauryaanoop3/running-ollama-on-google-colab-free-tier-a-step-by-step-guide-9ef74b1f8f7a)

MMLU-Pro dataset: ‘TIGER-Lab/MMLU-Pro’ [2406.01574] MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (arxiv.org)

Introduction

因爲想要發展大小模型：小模型使用 Llama3-8B 或是 Mistral-7B，大模型則用 gpt-4o (> 100B?)。

第一步先建立 baseline. 使用 MMLU Pro subset.

mmlu_pro_ollama_llama3.ipynb and mmlu_pro_gpt.ipynb

	大模型	小模型
MMLU	gpt-4o

Baseline

	GPT3.5	GPT4o	Llama3-8B	Mistral-7B
History, 381	44%	43%	27%-31%	15%-28%

Colab program: mmlu_gpt4o.ipynb

MMLU Pro Dataset

第一步是 load dataset. 這裏使用 TIGER-Lab/MMLU-Pro, 基本是一個 MMLU 的 subset.

!pip install datasets
import datasets
#from datasets import load_dataset

dataset = datasets.load_dataset('TIGER-Lab/MMLU-Pro')

dataset 的結構如下：

DatasetDict({
    test: Dataset({
        features: ['question_id', 'question', 'options', 'answer', 'answer_index', 'cot_content', 'category', 'src'],
        num_rows: 12032
    })
    validation: Dataset({
        features: ['question_id', 'question', 'options', 'answer', 'answer_index', 'cot_content', 'category', 'src'],
        num_rows: 70
    })
})

完整的 MMLU 簡化後 category 如下。

['computer science', 'math', 'chemistry', 'engineering', 'law', 'biology','health', 'physics', 'business', 'philosophy', 'economics', 'other','psychology', 'history']

一個記錄如下

print(dataset['test']['question_id'][0])
print(dataset['test']['question'][0])
print(dataset['test']['options'][0])
print(dataset['test']['answer'][0])
print(dataset['test']['answer_index'][0])
print(dataset['test']['category'][0])
#############################################
70 # id
Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence.
['Safe practices, Fear, Jealousy, Trivial', 'Unsafe practices, Distress, Joy, Trivial', 'Safe practices, Wants, Jealousy, Trivial', 'Safe practices, Distress, Fear, Trivial', 'Unsafe practices, Wants, Jealousy, Serious', 'Safe practices, Distress, Jealousy, Serious', 'Safe practices, Wants, Fear, Serious', 'Unsafe practices, Wants, Fear, Trivial', 'Unsafe practices, Distress, Fear, Serious']
I
8
business # category

Open AI API

2024/6/19 GPT-4o 和 GPT-3.5 Turbo 的價格差。

GPT-3.5 Turbo 便宜 10 倍，但是準確率也比較差。

GPT 設定

第二步是調用 gpt API，幾個重點：

設定模型本身：
- model = “gpt-4o” or “gpt-3.5-turbo-0125”
- 溫度: T = 0 是 greedy decode. T = 0.1 還是以穩定爲主。如果 T = 1 則是創意爲主。
- max_tokens: 最大的 context length, 也就是 KV cache size
- top_p, frequency_penalty, presence_penalty: ?
使用結構化 message 的 input prompt:
- role: 訂人設
- content: text 的 question (實際的 input)。應該可以加上 image 的content.
回傳 reponse, 其結構應該和 input prompt 一樣
- message.content: 這是 answer
- choices? 是同時產生幾個不同的 choices?

!pip install openai
from openai import OpenAI
client = OpenAI(api_key='put the key')

def run_one_question(question: str):
    response = client.chat.completions.create(
        model="gpt-4o",  # "gpt-3.5-turbo"
        messages=[
            {
                "role": "system",
                "content": "You are an knowledge expert, you are supposed to answer the multi-choice question to derive your final answer as `The answer is ...`."
            },
            {
                "role": "user",
                "content": [
                    {
                    "type": "text",
                    "text": question
                    }
                ]
            }
        ],
        temperature=0.1,
        max_tokens=4096,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )

    return response.choices[0].message.content

def form_options(options: list):
    option_str = 'Options are:\n'
    opts = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
    for opt, o in zip(options, opts):
        option_str += f'({o}): {opt}' + '\n'
    return option_str


def get_prediction(output):
    pattern = r"answer is \(?([ABCDEFGHIJ])\)?"
    match = re.search(pattern, output)
    if match:
        return match.group(1)
    else:
        print("extraction failed, do a random guess")
        return random.choice(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])
    
    
from tqdm import tqdm

print('----------------- Start Answering -------------------')
pbar = tqdm(dataset['test'], desc='Processing', unit='question')
for entry in pbar:
    prefix = prompts[entry['category']]  # prefix consists of examples, 4 shots
    query = prefix + 'Q: ' + entry['question'] + '\n' + form_options(entry['options']) + '\n'
    answer = run_one_question(query)  # answer is from GPT
    entry['solution'] = answer  # store the GPT answer, because entry['answer'] is used for ground truth, so use 'answer'
    answers.append(entry)
    prediction = get_prediction(answer) # get exact letter A, B, .. from GPT answer
    if entry["answer"] == prediction:   # compare ground truth with GPT answer
        success += 1
        per_category_accuracy[entry['category']][0] += 1
    else:
        fail += 1
        per_category_accuracy[entry['category']][1] += 1

    json_string = json.dumps(entry)
    file.write(json_string + '\n')

    success_rate = success / (success + fail)
    pbar.set_description(f'Processing (Success rate: {success_rate:.4f})')

for k, v in per_category_accuracy.items():
    if v[0] + v[1] > 0:
      print('accuracy: ', k, v[0] / (v[0] + v[1]))