🔹 1. LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning method for large language models (LLMs).

Instead of updating all parameters of a huge LLM (billions of weights), LoRA inserts small trainable matrices (low-rank adapters) into certain layers (usually attention and/or feed-forward layers).
During fine-tuning:
- Base model weights stay frozen (unchanged).
- Only the small adapter weights are trained.

This massively reduces:

If a weight matrix is W (say 4096 × 4096), instead of fine-tuning all ~16M parameters, LoRA trains two small matrices:

So you only train a few thousand parameters instead of millions.

🔹 2. QLoRA (Quantized LoRA)

QLoRA takes LoRA one step further by adding quantization.

Quantization = Compress model weights into fewer bits (e.g., 16-bit → 4-bit).
This saves GPU memory and makes training possible on smaller hardware.
QLoRA fine-tunes the quantized model with LoRA adapters on top.

So:

You can fine-tune 13B+ parameter models on a single consumer GPU (24GB VRAM).
Example: Guanaco, Alpaca, Vicuna fine-tunes often use QLoRA.
Enables democratization → people without supercomputers can fine-tune LLMs.

Feature	LoRA	QLoRA
Base Model	Full precision (16-bit/32-bit)	Quantized (4-bit/8-bit)
Memory Usage	Medium (needs decent GPU)	Very low (fits big models on consumer GPUs)
Training	Adapter training only	Adapter training only (on quantized model)
Speed	Fast	Even faster (smaller memory)
Trade-off	Slightly more accurate	Small accuracy drop due to quantization

LoRA = Adding small “adjustment knobs” to a giant machine, instead of rebuilding the whole machine.
QLoRA = Compressing the giant machine first, then adding the small adjustment knobs.

✅ In practice:

Use LoRA if you have strong GPU resources.
Use QLoRA if you want to fine-tune big models (7B–65B) on consumer GPUs (like RTX 3090, 4090, A100 40GB).