LoRA is a parameter-efficient fine-tuning technique used to adapt large language models (LLMs) like LLaMA, GPT, etc., to new tasks without retraining the entire model.
Instead of updating all the billions of parameters, LoRA:
-
Freezes the original model weights (keeps them unchanged)
-
Inserts small trainable low-rank matrices into certain layers (usually attention layers)
-
Only trains these small matrices, which are much smaller than the full model
⚙️ How LoRA Works (Simplified)
Imagine an LLM has a large weight matrix W
(like 4096×4096).
Normally, fine-tuning means updating all entries in W
→ which is huge.
With LoRA:
-
Keep
W
frozen. -
Add two small matrices:
-
A
(size 4096×r) -
B
(size r×4096) — where r is small (like 8 or 16)
-
-
Train only
A
andB
. -
At inference time, the effective weight becomes:
This drastically reduces the number of trainable parameters.
📊 Why LoRA is Useful
Aspect | Full Fine-Tune | LoRA Fine-Tune |
---|---|---|
Parameters updated | All (billions) | Few million (<<1%) |
GPU memory need | Very high | Very low |
Training speed | Slow | Fast |
Sharing | Must share full model | Just share small LoRA weights |
This makes LoRA ideal when:
-
You want to customize a big model on a small dataset
-
You have limited GPU resources
-
You want to train multiple variants of the same base model
📦 Common Uses
-
Domain-specific tuning (medical, legal, finance text)
-
Instruction tuning or chat-like behavior
-
Personalizing models for specific companies or users
-
Combining with PEFT (Parameter-Efficient Fine-Tuning) frameworks like:
-
🤗 Hugging Face PEFT
-
🤖 bitsandbytes
-
🦙 LLaMA + LoRA (common combo)
-
📝 Summary
LoRA = a lightweight way to fine-tune large models by training only tiny "adapter" layers (low-rank matrices) while keeping original weights frozen.
It dramatically reduces cost, time, and storage needs for customizing LLMs.
No comments:
Post a Comment