Tech Bites

Saturday, September 13, 2025

What is the Transformers library

🤖 What is the Transformers library

Transformers is an open-source Python library by Hugging Face that provides:

Pre-trained transformer models
Easy APIs to load, train, and use them
Support for tasks like text, vision, audio, and multi-modal AI

It is the most widely used library for working with LLMs (Large Language Models).

⚙️ What it Contains

Here’s what the transformers library gives you:

🧠 Pre-trained models

1000+ ready-to-use models like:
- GPT, BERT, RoBERTa, T5, LLaMA, Falcon, Mistral, BLOOM, etc.
Downloaded automatically from the Hugging Face Hub

⚒️ Model classes

AutoModel, AutoModelForCausalLM, AutoModelForSeq2SeqLM, etc.
These automatically select the right architecture class for a model

📄 Tokenizers

Converts text ↔ tokens (numbers) for the model
Very fast (often implemented in Rust)

📦 Pipelines

High-level API to run tasks quickly, for example:


from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
print(generator("Once upon a time"))

🏋️ Training utilities

Trainer and TrainingArguments for fine-tuning
Works with PyTorch, TensorFlow, and JAX

📊 Supported Tasks

Task	Example
Text Generation	Chatbots, storytelling
Text Classification	Spam detection, sentiment
Question Answering	QA bots
Translation	English → French
Summarization	Summarizing articles
Token Classification	Named entity recognition
Vision/Multimodal	Image captioning, VQA

💡 Why It’s Popular

Huge model zoo (open weights)
Unified interface across models
Active community and documentation
Compatible with Hugging Face ecosystem: Datasets, Accelerate, PEFT (LoRA)

📌 Summary

transformers is the go-to library for using and fine-tuning state-of-the-art AI models — especially large language models — with just a few lines of code.

What is LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning technique used to adapt large language models (LLMs) like LLaMA, GPT, etc., to new tasks without retraining the entire model.

Instead of updating all the billions of parameters, LoRA:

Freezes the original model weights (keeps them unchanged)
Inserts small trainable low-rank matrices into certain layers (usually attention layers)
Only trains these small matrices, which are much smaller than the full model

⚙️ How LoRA Works (Simplified)

Imagine an LLM has a large weight matrix W (like 4096×4096).

Normally, fine-tuning means updating all entries in W → which is huge.

With LoRA:

Keep W frozen.
Add two small matrices:
- A (size 4096×r)
- B (size r×4096) — where r is small (like 8 or 16)
Train only A and B.
At inference time, the effective weight becomes:
```
W' = W + A × B
```

This drastically reduces the number of trainable parameters.

📊 Why LoRA is Useful

Aspect	Full Fine-Tune	LoRA Fine-Tune
Parameters updated	All (billions)	Few million (<<1%)
GPU memory need	Very high	Very low
Training speed	Slow	Fast
Sharing	Must share full model	Just share small LoRA weights

This makes LoRA ideal when:

You want to customize a big model on a small dataset
You have limited GPU resources
You want to train multiple variants of the same base model

📦 Common Uses

Domain-specific tuning (medical, legal, finance text)
Instruction tuning or chat-like behavior
Personalizing models for specific companies or users
Combining with PEFT (Parameter-Efficient Fine-Tuning) frameworks like:
- 🤗 Hugging Face PEFT
- 🤖 bitsandbytes
- 🦙 LLaMA + LoRA (common combo)

📝 Summary

LoRA = a lightweight way to fine-tune large models by training only tiny "adapter" layers (low-rank matrices) while keeping original weights frozen.
It dramatically reduces cost, time, and storage needs for customizing LLMs.

Friday, September 5, 2025

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set llm_int8_enable_fp32_cpu_offload=True and pass a custom device_map to from_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

hat error usually pops up when you try to load a quantized model (like 4-bit QLoRA) with Hugging Face’s transformers and bitsandbytes, but your GPU doesn’t have enough VRAM to hold all the model layers. By default, Hugging Face tries to put everything on the GPU, and when it can’t, it errors out.

🔹 Why the Error Happens

The model is too large for your GPU (e.g., a 13B model on 8GB VRAM).
Some layers got “dispatched” (offloaded) onto CPU or disk.
You didn’t explicitly allow CPU offloading, so transformers raised a ValueError.

🔹 Solutions

✅ Option 1: Enable CPU Offload (Recommended if low VRAM)

You can allow the model to split between GPU + CPU:


from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",  # automatically spread across GPU + CPU
    llm_int8_enable_fp32_cpu_offload=True  # ✅ allows CPU fallback in FP32
)

✅ Option 2: Force Specific Device Map

If you want more control, specify which parts go where:


device_map = {
    "transformer.wte": "cuda:0",   # embeddings on GPU
    "transformer.h": "cpu",        # some layers on CPU
    "lm_head": "cuda:0"
}

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map=device_map,
    llm_int8_enable_fp32_cpu_offload=True
)

✅ Option 3: Use a Smaller Model

If you only have ~8–12 GB VRAM, try:

Mistral-7B-Instruct-v0.1 instead of v0.2
Or even smaller models (LLaMA-2-7B, Gemma-2B, etc.).

✅ Option 4: Use `accelerate` for Better Device Placement


pip install accelerate

Then run:


from accelerate import init_empty_weights, load_checkpoint_and_dispatch

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

This lets accelerate decide where to put layers across GPU/CPU/Disk.

🔹 Key Takeaway

If GPU VRAM < model size → must offload to CPU/disk.
Add llm_int8_enable_fp32_cpu_offload=True and/or device_map="auto".
Or use a smaller model to fit fully in GPU.

Tech Bites

Saturday, September 13, 2025

What is the Transformers library

🤖 What is the Transformers library

⚙️ What it Contains

🧠 Pre-trained models

⚒️ Model classes

📄 Tokenizers

📦 Pipelines

🏋️ Training utilities

📊 Supported Tasks

💡 Why It’s Popular

📌 Summary

What is LoRA (Low-Rank Adaptation)

⚙️ How LoRA Works (Simplified)

📊 Why LoRA is Useful

📦 Common Uses

📝 Summary

Friday, September 5, 2025

🔹 Why the Error Happens

🔹 Solutions

✅ Option 1: Enable CPU Offload (Recommended if low VRAM)

✅ Option 2: Force Specific Device Map

✅ Option 3: Use a Smaller Model

✅ Option 4: Use `accelerate` for Better Device Placement

🔹 Key Takeaway

What is the TRL library

Search This Blog

Saturday, September 13, 2025

What is the Transformers library

🤖 What is the Transformers library

⚙️ What it Contains

🧠 Pre-trained models

⚒️ Model classes

📄 Tokenizers

📦 Pipelines

🏋️ Training utilities

📊 Supported Tasks

💡 Why It’s Popular

📌 Summary

What is LoRA (Low-Rank Adaptation)

⚙️ How LoRA Works (Simplified)

📊 Why LoRA is Useful

📦 Common Uses

📝 Summary

Friday, September 5, 2025

🔹 Why the Error Happens

🔹 Solutions

✅ Option 1: Enable CPU Offload (Recommended if low VRAM)

✅ Option 2: Force Specific Device Map

✅ Option 3: Use a Smaller Model

✅ Option 4: Use accelerate for Better Device Placement

🔹 Key Takeaway

What is the TRL library

✅ Option 4: Use `accelerate` for Better Device Placement