LLM Coding

Tao Zou

2025-07-05

Basic Configuration

nvidia-smi

  • show NVIDIA GPU information and Driver version, CUDA version.
nvidia-smi

Create conda env

conda create -n llm_env python=3.10.12 -y

my env

  • requirements.txt
torch==2.5.1+cu121
transformers==4.46.3
peft==0.13.2
datasets==3.2.0
numpy==1.22.2

HPC env

torch==2.3.1
transformers==4.46.3
accelerate==0.34.0
peft==0.13.2
trl==0.14.0
datasets==3.2.0
wandb==0.19.6
numpy==1.23.5

Keep GPU running!

import torch 
import time


device = torch.device('cuda')
while True:
     tensor3 = tensor1 @ tensor2
     time.sleep(2)
     print(tensor3.device)

VSCode connect to HPC

Ensure download Python REPL extension, which is used for interactively running python code. And it is suggested that /.vscode/settings.json has the following content:

{
    "python.defaultInterpreterPath": "/opt/miniconda3/envs/pytorch/bin/python"
}

Model

Download model and tokenizer from Huggingface

Before downloading any model from huggingface, activate huggingface user token.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B'
model_download_path = '/hpc2hdd/home/tzou317/models'

model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', torch_dtype=torch.float32, force_download=True,  cache_dir=model_download_path)
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=model_download_path, force_download=True)
  • torch_dtype: torch_dtype can be torch.bfloat16 or torch.float32.

Load model and tokenizer

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = '/hpc2hdd/home/tzou317/models/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/ebf7e8d03db3d86a442d22d30d499abb7ec27bea'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', torch_dtype=torch.float32)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Make sure the files included in model_name are like:

config.json
generation_config.json
model-00001-of-000002.safetensors
model-00002-of-000002.safetensors
model.safetensors.index.json
tokenizer.json
tokenizer_config.json

Show model structure:

print(model)

Tokenizer

from transformers import AutoTokenizer

model_name = '/hpc2hdd/home/tzou317/models/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/ebf7e8d03db3d86a442d22d30d499abb7ec27bea'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

Make sure the files included in model_name are like:

tokenizer.json
tokenizer_config.json

Tokens

Special Tokens

print(f'bos_token is: {tokenizer.bos_token}, bos_id is: {tokenizer.bos_token_id}')
print(f'eos_token is: {tokenizer.eos_token}, eos_id is: {tokenizer.eos_token_id}')
print(f'pad_token is: {tokenizer.pad_token}, pad_id is: {tokenizer.pad_token_id}')

Common operations

# find a token by token_id
mytoken = tokenizer.convert_ids_to_tokens(128011)
print(mytoken)

apply_chat_template()

tokenizer.chat_template

python code

print(tokenizer.chat_template)
  1. 变量初始化部分{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}

  2. 提取系统消息部分{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}

  3. 开始添加标记和系统信息{{bos_token}}{{ns.system_prompt}}

  4. 处理用户信息{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}

  5. 处理助手信息(content为空的情况){%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}

如果助手的content为空,则处理工具调用信息。

  1. 处理助手信息(content不为空的情况){%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}

如果助手的content不为空,则根据ns.is_tool标记的状态处理消息。这里好像与DeepSeek-R1的</think>标签有关。在助手信息的前后分别加上<|Assistant|>标签和<|end▁of▁sentence|>'}标签。

  1. 结束部分{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}

如果ns.is_tool=True,则添加<|tool▁outputs▁end|>'}标签;如果add_generation_prompt=Truens.is_tool=False,则添加<|Assistant|>标签。

output

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}

AutoModelForCausalLM

Model Inference

single-turn inference

model.eval()
user_input = "在中文中应该用“随心所欲的创作”还是“所心所欲地创作”"
message = [{'role': 'user', 'content': user_input}]
model_input = tokenizer.apply_chat_template(message, tokenize=True, return_tensors='pt', add_generation_prompt=True).to(model.device)

response = model.generate(
    model_input,
    max_new_tokens=1000,
    do_sample=False,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)
output = tokenizer.decode(response[0], skip_special_tokens=True)
print(output)

multi-turn inference

model.eval()
messages = []
for idx, user_input in enumerate(iter(lambda: input("请输入你的问题:"), "")):
    messages.append({'role': 'user', 'content': user_input})
    prompt = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
    inputs = tokenzier(prompt, return_tensors="pt", truncation=True).to("cuda" if torch.cuda.is_available() else "cpu")
    generated_ids = model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
        num_return_sequences=1,
        do_sample=True,
        attention_mask=inputs["attention_mask"],
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    print(response)
    

PEFT (LoRA)

LoRA fine-tune

Suppose I have \(r=16\), then the number of parameters will be reduced to \(\frac{4096\times16\times2}{4096\times4096}=0.8\%\) of the original.

\[\boldsymbol{\Delta W}_{4096\times4096}=\boldsymbol{B}_{4096\times16}\boldsymbol{A}_{16\times4096}\]

\(\boldsymbol{A}\) is initiated by \(\mathcal{N}(0, 1)\), and \(\boldsymbol{A}\) is initiated by \(\boldsymbol{0}\). The different initialization methods are a knid of balance between random search and stable training.

Code

different datasource

From csv

from peft import LoraConfig, TaskType, PeftModel, LoftQConfig, get_peft_model
from trl import DataCollatorForCompletionOnlyLM, SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments
import wandb

wandb.login(key="...")

dataset = load_dataset('csv', data_files={
    'train': '../train_dataset.csv',
    'validation': '../dev_dataset.csv'
})
peft_output_dir = '../fine-tune/LoRA_Layer'

def formatting_func(dataset):
    formatted_texts = []
    for i in range(len(dataset['prompts'])):  # be care of this line
        system_message = '下面是小明与小红两个人之间的对话,你需要模仿小红的讲话风格和内容然后与小明进行聊天。\n'
        user_message = dataset['prompts'][i]
        assistant_message = dataset['responses'][i]
        message = [
            {'role': 'system', 'content': system_message},
            {'role': 'user', 'content': user_message},
            {'role': 'assistant', 'content': assistant_message}
        ]
        text = tokenizer.apply_chat_template(
            message, 
            tokenize=False, 
            add_generation_prompt=False,
            bos_token=tokenizer.bos_token,
            eos_token=tokenizer.eos_token)
        formatted_texts.append(text)
    return formatted_texts
# print(formatting_func(dataset['train'])[0])

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=64,
    lora_alpha=32,
    use_rslora=True,
    lora_dropout=0.1,
    target_modules=[
        "q_proj",
        "v_proj",
        "k_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head"
    ]
    # init_lora_weights='loftq',
    # loftq_config=loftq_config
)

training_arguments = SFTConfig(
    output_dir=peft_output_dir,
    overwrite_output_dir=True,
    num_train_epochs=1,
    load_best_model_at_end=False,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    evaluation_strategy="steps", eval_steps=0.15,
    max_grad_norm=0.3,
    auto_find_batch_size=False,
    save_total_limit=3,
    gradient_accumulation_steps=16,
    save_steps=50,
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    bf16=False,
    warmup_ratio=0.01,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to="wandb",
    neftune_noise_alpha=5,
    max_seq_length=3000,
    packing=False
)

instruction_template = '<|User|>'
response_template = '<|Assistant|>'
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    formatting_func=formatting_prompt_func,
    data_collator=collator,
    tokenizer=tokenizer,
    args=training_arguments
)
trainer.train()

trainer.model.save_pretrained(peft_output_dir)

When adding tokenizer.add_bos_token = False, the to be training dataset in train_dataloader = trainer.get_train_dataloader() will only have one bos_token at the begining of each text. If not adding tokenizer.add_bos_token = False, the dataset will have two bos_token, and I don’t know why. And I think the reason may belong to SFTTrainer().

From jsonl

.jsonl file format likes:

{"conversations": [{"role": "user", "content": "你好,你是谁"}, {"role": "assistant", "content": "我是由华为公司开发的大模型"}]}
{"conversations": [{"role": "user", "content": "你是什么大模型?"}, {"role": "assistant", "content": "我是由华为公司开发的大模型"}]}
{"conversations": [{"role": "user", "content": "你是由谁开发的?"}, {"role": "assistant", "content": "我是由华为公司开发的大模型"}]}
...
from datasets import load_dataset

dataset = load_dataset("json", data_files="myjsonl.jsonl", split="train")  # 文本未预先被分割,暂时就默认将其整个当作训练集
dataset = dataset.rename_column("conversations", "messages")
dataset = dataset.train_test_split(test_size=0.1)
train_data = dataset["train"]
valid_data = dataset["test"]

peft_config = LoraConfig(...)

training_arguments = SFTConfig(...)

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    train_dataset=train_data,
    eval_dataset=valid_data,
    data_collator=collator,
    tokenizer=tokenizer,
    args=training_arguments
)

trainer.train()

验证Tokenizer聊天模板:

sample = dataset["train"][0]["message"]
print(sample)
print(tokenizer.apply_chat_template(sample, tokenize=False))

finetune on multi-GPU

  1. remove device_map="auto" in AutoModelForCausalLM.from_pretrained().

  2. accelerate config

  3. accelerate launch --multi_gpu train.py

data_collator

trainer.get_train_dataloader()

import itertools

train_dataloader = trainer.get_train_dataloader()
sample = next(itertools.islice(train_dataloader, 100, 101))

# The shape of the three tensor are same: (batch_size, seq_length)
print(sample['input_ids'].shape, sample['attention_mask'].shape, sample['labels'].shape, sep='\n')

first_sample = {k: v[0] for k, v in sample.items()}
input_text = tokenizer.decode(
    first_sample["input_ids"], 
    skip_special_tokens=False
)
print(input_text)

wandb

The key for wandb.login() should be found in https://wandb.ai/site.

The training process will be reported to wandb.

LoRA Layer merge

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, "path/to/peft/checkpoint")
model = model.merge_and_unload()
print("Lora layer merged successfully")

Adapter & Prefix Tuning

Adapter在自注意力模块之后,在残差连接之前、和在MLP模块之后,在残差连接之前添加适配器层作为可训练的参数。