LoRA微调训练和量化

安装依赖

uv add datasets peft trl accelerate

uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
uv pip install "unsloth[cu130] @ git+https://github.com/unslothai/unsloth.git"

Datasets: 这是一个用于轻松访问和共享音频、计算机视觉及自然语言处理（NLP）数据集的库，只需一行代码即可加载并处理超大规模的数据。
PEFT (Parameter-Efficient Fine-Tuning): 专门用于“参数高效微调”，让你能在不微调模型所有参数的情况下（如使用 LoRA 技术），通过仅训练极少量参数达到媲美全量微调的效果。
TRL (Transformer Reinforcement Learning): 一个使用强化学习训练 Transformer 模型的全栈工具库，主要用于实现 RLHF（基于人类反馈的强化学习）和 DPO 等对齐训练。
Accelerate: 一个旨在简化分布式训练和混合精度训练的库，让你只需修改极少代码，就能在 CPU、单个 GPU、多 GPU 或 TPU 等不同硬件配置上运行相同的 PyTorch 训练脚本。

模型和数据集

参考: https://colab.research.google.com/drive/1D3g2KPFnEZ7ZF9IP-vxwjQIb1md_o5_X#scrollTo=53bfef85-3657-4f51-bba7-78abff394117

https://huggingface.co/DavidAU/Qwen3.5-2B-GPT-5.1-HighIQ-INSTRUCT
https://huggingface.co/datasets/maomao88/anime-waifu-personality-chat-with-questions

Installation

%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth  # Do this in local & cloud setups
else:
    import torch; v = re.match(r'[\d]{1,}\.[\d]{1,}', str(torch.__version__)).group(0)
    xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, "0.0.34")
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth
!pip install transformers==5.3.0
!pip install --no-deps trl==0.22.2

Unsloth

from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit", # Llama 3.2 vision support
    "unsloth/Llama-3.2-11B-Vision-bnb-4bit",
    "unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit", # Can fit in a 80GB card!
    "unsloth/Llama-3.2-90B-Vision-bnb-4bit",

    "unsloth/Pixtral-12B-2409-bnb-4bit",              # Pixtral fits in 16GB!
    "unsloth/Pixtral-12B-Base-2409-bnb-4bit",         # Pixtral base model

    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",          # Qwen2 VL support
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    "unsloth/Qwen2-VL-72B-Instruct-bnb-4bit",

    "unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit",      # Any Llava variant works!
    "unsloth/llava-1.5-7b-hf-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen3.5-4B",
    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.3.5: Fast Qwen3_5 patching. Transformers: 5.3.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for qwen3_5 won't work! Using float32.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.



model.safetensors.index.json: 0.00B [00:00, ?B/s]



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]


The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d



Loading weights:   0%|          | 0/723 [00:00<?, ?it/s]



processor_config.json: 0.00B [00:00, ?B/s]



chat_template.jinja: 0.00B [00:00, ?B/s]



preprocessor_config.json:   0%|          | 0.00/336 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]



tokenizer.json:   0%|          | 0.00/20.0M [00:00<?, ?B/s]



video_preprocessor_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

为视觉语言模型（VLM）配置并加载 LoRA（Low-Rank Adaptation）微调参数

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

Unsloth: Making `model.base_model.model.model.visual` require gradients

数据准备

用LaTeX数据集微调 VL 模型, 数据集: unsloth/LaTeX_OCR

from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")

README.md:   0%|          | 0.00/519 [00:00<?, ?B/s]



data/train-00000-of-00001.parquet:   0%|          | 0.00/344M [00:00<?, ?B/s]



data/test-00000-of-00001.parquet:   0%|          | 0.00/38.2M [00:00<?, ?B/s]



Generating train split:   0%|          | 0/68686 [00:00<?, ? examples/s]



Generating test split:   0%|          | 0/7632 [00:00<?, ? examples/s]

Let’s take an overview look at the dataset. We shall see what the 3rd image is, and what caption it had.

dataset

Dataset({
    features: ['image', 'text'],
    num_rows: 68686
})

dataset[2]["image"]

dataset[2]["text"]

'H ^ { \\prime } = \\beta N \\int d \\lambda \\biggl \\{ \\frac { 1 } { 2 \\beta ^ { 2 } N ^ { 2 } } \\partial _ { \\lambda } \\zeta ^ { \\dagger } \\partial _ { \\lambda } \\zeta + V ( \\lambda ) \\zeta ^ { \\dagger } \\zeta \\biggr \\} \\ .'

from IPython.display import display, Math, Latex

latex = dataset[2]["text"]
display(Math(latex))

$\displaystyle H ^ { \prime } = \beta N \int d \lambda \biggl \{ \frac { 1 } { 2 \beta ^ { 2 } N ^ { 2 } } \partial _ { \lambda } \zeta ^ { \dagger } \partial _ { \lambda } \zeta + V ( \lambda ) \zeta ^ { \dagger } \zeta \biggr \} \ .$

数据集需要按照结构构造问答对

[
{ "role": "user",
  "content": [{"type": "text",  "text": Q}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": A} ]
},
]

import time

instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : sample["text"]} ]
        },
    ]
    return { "messages" : conversation }

total = len(dataset)
converted_dataset = []
start_time = time.time()

print(f"🚀 开始转换数据集, 总计: {total} 条")

for i, sample in enumerate(dataset):
    converted_dataset.append(convert_to_conversation(sample))

    # 每完成 10% 打印一次日志
    if (i + 1) % (total // 10) == 0 or (i + 1) == total:
        percent = (i + 1) / total * 100
        elapsed = time.time() - start_time
        print(f"✅ 已完成: {percent:>3.0f}% | 进度: {i+1}/{total} | 耗时: {elapsed:.1f}s")

print("✨ 数据集转换完成！")

🚀 开始转换数据集, 总计: 68686 条
✅ 已完成:  10% | 进度: 6868/68686 | 耗时: 2.0s
✅ 已完成:  20% | 进度: 13736/68686 | 耗时: 4.1s
✅ 已完成:  30% | 进度: 20604/68686 | 耗时: 6.9s
✅ 已完成:  40% | 进度: 27472/68686 | 耗时: 9.3s
✅ 已完成:  50% | 进度: 34340/68686 | 耗时: 11.8s
✅ 已完成:  60% | 进度: 41208/68686 | 耗时: 13.8s
✅ 已完成:  70% | 进度: 48076/68686 | 耗时: 16.6s
✅ 已完成:  80% | 进度: 54944/68686 | 耗时: 18.8s
✅ 已完成:  90% | 进度: 61812/68686 | 耗时: 21.1s
✅ 已完成: 100% | 进度: 68680/68686 | 耗时: 24.1s
✅ 已完成: 100% | 进度: 68686/68686 | 耗时: 24.1s
✨ 数据集转换完成！

converted_dataset[2]

{'messages': [{'role': 'user',
   'content': [{'type': 'text',
     'text': 'Write the LaTeX representation for this image.'},
    {'type': 'image',
     'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=320x50>}]},
  {'role': 'assistant',
   'content': [{'type': 'text',
     'text': 'H ^ { \\prime } = \\beta N \\int d \\lambda \\biggl \\{ \\frac { 1 } { 2 \\beta ^ { 2 } N ^ { 2 } } \\partial _ { \\lambda } \\zeta ^ { \\dagger } \\partial _ { \\lambda } \\zeta + V ( \\lambda ) \\zeta ^ { \\dagger } \\zeta \\biggr \\} \\ .'}]}]}

微调前测试

FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

 $$H ^ { \prime } = \beta N \int d \lambda \Bigl \{ \frac { 1 } { 2 \beta ^ { 2 } N ^ { 2 } } \partial _ { \lambda } \zeta ^ { \dagger } \partial _ { \lambda } \zeta + V ( \lambda ) \zeta ^ { \dagger } \zeta \Bigr \} \; .$$<|im_end|>
<|endoftext|>

Train the model

Now let’s train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None. We also support DPOTrainer and GRPOTrainer for reinforcement learning!

We use our new UnslothVisionDataCollator which will help in our vision finetuning setup.

from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = converted_dataset,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 30,
        # num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        max_length = 2048,
    ),
)


# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Unsloth: Model does not have a default image size - using 512
Unsloth: Switching to float32 training since model cannot work with float16
GPU = Tesla T4. Max memory = 14.563 GB.
9.68 GB of memory reserved.

trainer_stats = trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 248046}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 68,686 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 38,756,352 of 4,578,021,888 (0.85% trained)


Unsloth: Will smartly offload gradients to save VRAM!




<div>

  <progress value='30' max='30' style='width:300px; height:20px; vertical-align: middle;'></progress>
  [30/30 05:33, Epoch 0/1]
</div>
<table border="1" class="dataframe">

StepTraining Loss10.69999020.86126030.66202340.45182450.36467860.34765370.24222480.13397090.087778100.092887110.052814120.073216130.065354140.026245150.037974160.035780170.047666180.035650190.038626200.059324210.035856220.037750230.022070240.032624250.077887260.059872270.043020280.183679290.044639300.078759

# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

412.4279 seconds used for training.
6.87 minutes used for training.
Peak reserved memory = 9.922 GB.
Peak reserved memory for training = 0.242 GB.
Peak reserved memory % of max memory = 68.132 %.
Peak reserved memory for training % of max memory = 1.662 %.

Inference

Let’s run the model! You can change the instruction and input - leave the output blank!

We use min_p = 0.1 and temperature = 1.5. Read this Tweet for more information on why.

FastVisionModel.for_inference(model) # Enable for inference!

image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

H ^ { \prime } = \beta N \int d \lambda \biggl \{ \frac { 1 } { 2 \beta ^ { 2 } N ^ { 2 } } \partial _ { \lambda } \zeta ^ { \dagger } \partial _ { \lambda } \zeta + V ( \lambda ) \zeta ^ { \dagger } \zeta \biggr \} \ .<|im_end|>

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Hugging Face’s push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

YOUR_HF_TOKEN = ""

model.save_pretrained("qwen3.5_4B_lora_latex")  # Local saving
tokenizer.save_pretrained("qwen3.5_4B_lora_latex")
model.push_to_hub("Weidows/qwen3.5_4B_lora_latex", token = YOUR_HF_TOKEN) # Online saving
tokenizer.push_to_hub("Weidows/qwen3.5_4B_lora_latex", token = YOUR_HF_TOKEN) # Online saving

README.md:   0%|          | 0.00/529 [00:00<?, ?B/s]



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...adapter_model.safetensors:   0%|          | 52.7kB /  155MB


Saved model to https://huggingface.co/Weidows/qwen3.5_4B_lora_latex



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...mpof4d0ctj/tokenizer.json:  60%|#####9    | 12.0MB / 20.0MB

import torch
import gc

# 1. 显式删除模型和推理/训练器
try:
    del model
    del tokenizer
    del trainer # 如果你定义了训练器，必须删掉它
except NameError:
    pass

# 2. 强制垃圾回收
gc.collect()

# 3. 清空 CUDA 缓存
torch.cuda.empty_cache()

# 4. 特别针对 Unsloth：重置峰值监控（有时能触发系统级回收）
torch.cuda.reset_peak_memory_stats()
# torch.cuda.reset_accumulated_stats() # Removed this line as it caused an AttributeError

print(f"清理后逻辑占用: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")

清理后逻辑占用: 0.00 MB

现在，我们将从本地保存的路径加载微调后的模型和 tokenizer。

from unsloth import FastVisionModel

# 从本地路径加载微调后的模型
model, tokenizer = FastVisionModel.from_pretrained(
    model_name = "qwen3.5_4B_lora_latex", # 本地保存的模型路径
    load_in_4bit = True, # 根据需要选择加载方式，这里使用4bit加载
)

# 设置模型为推理模式
FastVisionModel.for_inference(model)

print("微调模型已成功加载并设置为推理模式。")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.3.5: Fast Qwen3_5 patching. Transformers: 5.3.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for qwen3_5 won't work! Using float32.


The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d



Loading weights:   0%|          | 0/723 [00:00<?, ?it/s]


微调模型已成功加载并设置为推理模式。

Saving to float16 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.

# Select ONLY 1 to save! (Both not needed!)

# Save locally to 16bit
if True: model.save_pretrained_merged("Qwen3.5-4B-LaTeX", tokenizer,)

# To export and save to your Hugging Face account
if True: model.push_to_hub_merged("Weidows/Qwen3.5-4B-LaTeX", tokenizer, token = YOUR_HF_TOKEN)

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:  50%|█████     | 1/2 [01:20<01:20, 80.41s/it][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`: 100%|██████████| 2/2 [02:21<00:00, 70.88s/it]


Successfully copied all 2 files from cache to `Qwen3.5-4B-LaTeX`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 14438.22it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [01:36<01:36, 96.44s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [02:45<00:00, 82.63s/it]


Unsloth: Merge process complete. Saved to `/content/Qwen3.5-4B-LaTeX`



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...5-4B-LaTeX/tokenizer.json: 100%|##########| 20.0MB / 20.0MB


Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `Weidows/Qwen3.5-4B-LaTeX`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `Weidows/Qwen3.5-4B-LaTeX`:  50%|█████     | 1/2 [01:01<01:01, 61.75s/it][A
Unsloth: Copying 2 files from cache to `Weidows/Qwen3.5-4B-LaTeX`: 100%|██████████| 2/2 [01:46<00:00, 53.28s/it]


Successfully copied all 2 files from cache to `Weidows/Qwen3.5-4B-LaTeX`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 17924.38it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A


Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...0001-of-00002.safetensors:   0%|          | 92.5kB / 5.33GB



Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [02:40<02:40, 160.85s/it][A


Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...0002-of-00002.safetensors:   0%|          | 73.4kB / 3.99GB



Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [04:55<00:00, 147.86s/it]


Unsloth: Merge process complete. Saved to `/content/Weidows/Qwen3.5-4B-LaTeX`

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our docs page):

q8_0 - Fast conversion. High resource use, but generally acceptable.
q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[NEW] To finetune and auto export to Ollama, try our Ollama notebook

# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("Qwen3.5-4B-LaTeX", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if True: model.push_to_hub_gguf("Weidows/Qwen3.5-4B-LaTeX", tokenizer, token = YOUR_HF_TOKEN)

# Save to 16bit GGUF
if True: model.save_pretrained_gguf("Qwen3.5-4B-LaTeX", tokenizer, quantization_method = "f16")
if True: model.push_to_hub_gguf("Weidows/Qwen3.5-4B-LaTeX", tokenizer, quantization_method = "f16", token = YOUR_HF_TOKEN)

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("Qwen3.5-4B-LaTeX", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("Weidows/Qwen3.5-4B-LaTeX", tokenizer, quantization_method = "q4_k_m", token = YOUR_HF_TOKEN)

# Save to multiple GGUF options - much faster if you want multiple!
if True:
    model.push_to_hub_gguf(
        "Weidows/Qwen3.5-4B-LaTeX", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = YOUR_HF_TOKEN,
    )

Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:  50%|█████     | 1/2 [00:56<00:56, 56.93s/it][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`: 100%|██████████| 2/2 [01:45<00:00, 52.66s/it]


Successfully copied all 2 files from cache to `Qwen3.5-4B-LaTeX`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 14873.42it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [01:09<01:09, 69.14s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [01:58<00:00, 59.14s/it]


Unsloth: Merge process complete. Saved to `/content/Qwen3.5-4B-LaTeX`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: Cloning llama.cpp repository...
Unsloth: Building llama.cpp - please wait 1 to 3 minutes
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...


WARNING:unsloth_zoo.llama_cpp:Unsloth: Qwen2MoE num_experts patch target not found.


Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16.gguf', 'Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: [2] Converting GGUF f16 into q8_0. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.Q8_0.gguf', 'Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Qwen3.5-4B'. Skipping Ollama Modelfile


Unsloth: example usage for Multimodal LLMs: /root/.unsloth/llama.cpp/llama-mtmd-cli -m Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.Q8_0.gguf --mmproj Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_8wilpvla`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_8wilpvla`:  50%|█████     | 1/2 [01:01<01:01, 61.98s/it][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_8wilpvla`: 100%|██████████| 2/2 [01:40<00:00, 50.27s/it]


Successfully copied all 2 files from cache to `/tmp/unsloth_gguf_8wilpvla`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 15563.28it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [01:02<01:02, 62.57s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [01:41<00:00, 50.95s/it]


Unsloth: Merge process complete. Saved to `/tmp/unsloth_gguf_8wilpvla`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['/tmp/unsloth_gguf_8wilpvla_gguf/Qwen3.5-4B.F16.gguf', '/tmp/unsloth_gguf_8wilpvla_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: [2] Converting GGUF f16 into q8_0. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['/tmp/unsloth_gguf_8wilpvla_gguf/Qwen3.5-4B.Q8_0.gguf', '/tmp/unsloth_gguf_8wilpvla_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Qwen3.5-4B'. Skipping Ollama Modelfile


Unsloth: example usage for Multimodal LLMs: /root/.unsloth/llama.cpp/llama-mtmd-cli -m /tmp/unsloth_gguf_8wilpvla_gguf/Qwen3.5-4B.Q8_0.gguf --mmproj /tmp/unsloth_gguf_8wilpvla_gguf/Qwen3.5-4B.F16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Uploading GGUF to Huggingface Hub...
Uploading Qwen3.5-4B.Q8_0.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...gguf/Qwen3.5-4B.Q8_0.gguf:   0%|          | 10.7MB / 4.48GB


Uploading Qwen3.5-4B.F16-mmproj.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...wen3.5-4B.F16-mmproj.gguf:   1%|          | 3.67MB /  672MB


Uploading config.json...


No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.


Unsloth: Successfully uploaded GGUF to https://huggingface.co/Weidows/Qwen3.5-4B-LaTeX
Unsloth: Cleaning up temporary files...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:  50%|█████     | 1/2 [01:13<01:13, 73.53s/it][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`: 100%|██████████| 2/2 [02:07<00:00, 63.56s/it]


Successfully copied all 2 files from cache to `Qwen3.5-4B-LaTeX`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 17015.43it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [02:00<02:00, 120.62s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [02:59<00:00, 89.98s/it]


Unsloth: Merge process complete. Saved to `/content/Qwen3.5-4B-LaTeX`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16.gguf', 'Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16.gguf', 'Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Qwen3.5-4B'. Skipping Ollama Modelfile


Unsloth: example usage for Multimodal LLMs: /root/.unsloth/llama.cpp/llama-mtmd-cli -m Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16.gguf --mmproj Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_cwymog7t`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_cwymog7t`:  50%|█████     | 1/2 [01:10<01:10, 70.33s/it][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_cwymog7t`: 100%|██████████| 2/2 [02:10<00:00, 65.00s/it]


Successfully copied all 2 files from cache to `/tmp/unsloth_gguf_cwymog7t`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 16384.00it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [02:07<02:07, 127.75s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [03:04<00:00, 92.33s/it]


Unsloth: Merge process complete. Saved to `/tmp/unsloth_gguf_cwymog7t`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['/tmp/unsloth_gguf_cwymog7t_gguf/Qwen3.5-4B.F16.gguf', '/tmp/unsloth_gguf_cwymog7t_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['/tmp/unsloth_gguf_cwymog7t_gguf/Qwen3.5-4B.F16.gguf', '/tmp/unsloth_gguf_cwymog7t_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Qwen3.5-4B'. Skipping Ollama Modelfile


Unsloth: example usage for Multimodal LLMs: /root/.unsloth/llama.cpp/llama-mtmd-cli -m /tmp/unsloth_gguf_cwymog7t_gguf/Qwen3.5-4B.F16.gguf --mmproj /tmp/unsloth_gguf_cwymog7t_gguf/Qwen3.5-4B.F16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Uploading GGUF to Huggingface Hub...
Uploading Qwen3.5-4B.F16.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ..._gguf/Qwen3.5-4B.F16.gguf:   0%|          | 10.7MB / 8.42GB


Uploading Qwen3.5-4B.F16-mmproj.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...wen3.5-4B.F16-mmproj.gguf:   5%|4         | 31.9MB /  672MB


Uploading config.json...


No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.


Unsloth: Successfully uploaded GGUF to https://huggingface.co/Weidows/Qwen3.5-4B-LaTeX
Unsloth: Cleaning up temporary files...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`:  50%|█████     | 1/2 [00:54<00:54, 54.24s/it][A
Unsloth: Copying 2 files from cache to `Qwen3.5-4B-LaTeX`: 100%|██████████| 2/2 [01:37<00:00, 48.61s/it]


Successfully copied all 2 files from cache to `Qwen3.5-4B-LaTeX`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 17734.90it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [00:56<00:56, 56.87s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [01:44<00:00, 52.25s/it]


Unsloth: Merge process complete. Saved to `/content/Qwen3.5-4B-LaTeX`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16.gguf', 'Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: [2] Converting GGUF f16 into q4_k_m. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.Q4_K_M.gguf', 'Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Qwen3.5-4B'. Skipping Ollama Modelfile


Unsloth: example usage for Multimodal LLMs: /root/.unsloth/llama.cpp/llama-mtmd-cli -m Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.Q4_K_M.gguf --mmproj Qwen3.5-4B-LaTeX_gguf/Qwen3.5-4B.F16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_0t1qmj62`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_0t1qmj62`:  50%|█████     | 1/2 [00:58<00:58, 58.85s/it][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_0t1qmj62`: 100%|██████████| 2/2 [01:43<00:00, 51.83s/it]


Successfully copied all 2 files from cache to `/tmp/unsloth_gguf_0t1qmj62`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 17439.93it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [01:08<01:08, 68.45s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [01:54<00:00, 57.28s/it]


Unsloth: Merge process complete. Saved to `/tmp/unsloth_gguf_0t1qmj62`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['/tmp/unsloth_gguf_0t1qmj62_gguf/Qwen3.5-4B.F16.gguf', '/tmp/unsloth_gguf_0t1qmj62_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: [2] Converting GGUF f16 into q4_k_m. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['/tmp/unsloth_gguf_0t1qmj62_gguf/Qwen3.5-4B.Q4_K_M.gguf', '/tmp/unsloth_gguf_0t1qmj62_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Qwen3.5-4B'. Skipping Ollama Modelfile


Unsloth: example usage for Multimodal LLMs: /root/.unsloth/llama.cpp/llama-mtmd-cli -m /tmp/unsloth_gguf_0t1qmj62_gguf/Qwen3.5-4B.Q4_K_M.gguf --mmproj /tmp/unsloth_gguf_0t1qmj62_gguf/Qwen3.5-4B.F16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Uploading GGUF to Huggingface Hub...
Uploading Qwen3.5-4B.Q4_K_M.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...uf/Qwen3.5-4B.Q4_K_M.gguf:   0%|          | 10.7MB / 2.71GB


Uploading Qwen3.5-4B.F16-mmproj.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...wen3.5-4B.F16-mmproj.gguf:   6%|5         | 39.9MB /  672MB


Uploading config.json...


No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.


Unsloth: Successfully uploaded GGUF to https://huggingface.co/Weidows/Qwen3.5-4B-LaTeX
Unsloth: Cleaning up temporary files...
Unsloth: Converting model to GGUF format...
Unsloth: Merging model weights to 16-bit format...
Found HuggingFace hub cache directory: /root/.cache/huggingface/hub



Downloading (incomplete total...): 0.00B [00:00, ?B/s]



Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Checking cache directory for required files...



Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_99z_hskz`:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_99z_hskz`:  50%|█████     | 1/2 [00:55<00:55, 55.29s/it][A
Unsloth: Copying 2 files from cache to `/tmp/unsloth_gguf_99z_hskz`: 100%|██████████| 2/2 [01:37<00:00, 48.82s/it]


Successfully copied all 2 files from cache to `/tmp/unsloth_gguf_99z_hskz`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.



Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 15505.74it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s][A
Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [01:09<01:09, 69.91s/it][A
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [01:55<00:00, 57.88s/it]


Unsloth: Merge process complete. Saved to `/tmp/unsloth_gguf_99z_hskz`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m', 'q8_0', 'q5_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: llama.cpp found in the system. Skipping installation.
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['/tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.F16.gguf', '/tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: [2] Converting GGUF f16 into q4_k_m. This might take 10 minutes...
Unsloth: [2] Converting GGUF f16 into q8_0. This might take 10 minutes...
Unsloth: [2] Converting GGUF f16 into q5_k_m. This might take 10 minutes...
Unsloth: Model files cleanup...
Unsloth: All GGUF conversions completed successfully!
Generated files: ['/tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.Q5_K_M.gguf', '/tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.Q8_0.gguf', '/tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.Q4_K_M.gguf', '/tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.F16-mmproj.gguf']
Unsloth: No Ollama template mapping found for model 'unsloth/Qwen3.5-4B'. Skipping Ollama Modelfile


Unsloth: example usage for Multimodal LLMs: /root/.unsloth/llama.cpp/llama-mtmd-cli -m /tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.Q5_K_M.gguf --mmproj /tmp/unsloth_gguf_99z_hskz_gguf/Qwen3.5-4B.F16-mmproj.gguf
Unsloth: load image inside llama.cpp runner: /image test_image.jpg
Unsloth: Prompt model to describe the image
Unsloth: Uploading GGUF to Huggingface Hub...
Uploading Qwen3.5-4B.Q5_K_M.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...uf/Qwen3.5-4B.Q5_K_M.gguf:   1%|          | 23.6MB / 3.07GB


Uploading Qwen3.5-4B.Q8_0.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...gguf/Qwen3.5-4B.Q8_0.gguf:   1%|          | 24.0MB / 4.48GB


Uploading Qwen3.5-4B.Q4_K_M.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...uf/Qwen3.5-4B.Q4_K_M.gguf:   1%|          | 23.9MB / 2.71GB


Uploading Qwen3.5-4B.F16-mmproj.gguf...



Processing Files (0 / 0)      : |          |  0.00B /  0.00B



New Data Upload               : |          |  0.00B /  0.00B



  ...wen3.5-4B.F16-mmproj.gguf:   7%|7         | 48.0MB /  672MB


Uploading config.json...


No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.


Unsloth: Successfully uploaded GGUF to https://huggingface.co/Weidows/Qwen3.5-4B-LaTeX
Unsloth: Cleaning up temporary files...