🔹 1. LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning method for large language models (LLMs).
🧠The Core Idea
-
Instead of updating all parameters of a huge LLM (billions of weights), LoRA inserts small trainable matrices (low-rank adapters) into certain layers (usually attention and/or feed-forward layers).
-
During fine-tuning:
-
Base model weights stay frozen (unchanged).
-
Only the small adapter weights are trained.
-
This massively reduces:
-
Memory usage 💾
-
Compute cost ⚡
-
Training time ⏱️
🔹 LoRA Example
If a weight matrix is W
(say 4096 × 4096), instead of fine-tuning all ~16M parameters, LoRA trains two small matrices:
-
A
(4096 × r) andB
(r × 4096), wherer
is the rank (say 8 or 16). -
The effective update is:
So you only train a few thousand parameters instead of millions.
🔹 2. QLoRA (Quantized LoRA)
QLoRA takes LoRA one step further by adding quantization.
🧠The Core Idea
-
Quantization = Compress model weights into fewer bits (e.g., 16-bit → 4-bit).
This saves GPU memory and makes training possible on smaller hardware. -
QLoRA fine-tunes the quantized model with LoRA adapters on top.
So:
-
Base model → 4-bit quantized (efficient storage + inference).
-
Train only LoRA adapters (small rank matrices).
-
Combine for final fine-tuned model.
🔹 Why QLoRA is Powerful
-
You can fine-tune 13B+ parameter models on a single consumer GPU (24GB VRAM).
-
Example: Guanaco, Alpaca, Vicuna fine-tunes often use QLoRA.
-
Enables democratization → people without supercomputers can fine-tune LLMs.
🔹 LoRA vs QLoRA (Quick Comparison)
Feature | LoRA | QLoRA |
---|---|---|
Base Model | Full precision (16-bit/32-bit) | Quantized (4-bit/8-bit) |
Memory Usage | Medium (needs decent GPU) | Very low (fits big models on consumer GPUs) |
Training | Adapter training only | Adapter training only (on quantized model) |
Speed | Fast | Even faster (smaller memory) |
Trade-off | Slightly more accurate | Small accuracy drop due to quantization |
🔹 Visual Analogy
-
LoRA = Adding small “adjustment knobs” to a giant machine, instead of rebuilding the whole machine.
-
QLoRA = Compressing the giant machine first, then adding the small adjustment knobs.
✅ In practice:
-
Use LoRA if you have strong GPU resources.
-
Use QLoRA if you want to fine-tune big models (7B–65B) on consumer GPUs (like RTX 3090, 4090, A100 40GB).
No comments:
Post a Comment