Deep Dive into LoRA: A Practical Exploration

← back
26 min read· 31 Aug 2025
Deep Dive into LoRA: A Practical Exploration
Contents

Introduction#

LoRA or Low Rank Adaptation is one of the techniques which was introduced to train large language model efficiently. To quote it in numbers, using this technique while training GPT3 175B number of trainable parameters can be reduced by 10000x, all while keeping the performance at par or better than fully finetuned model. In this blog i try to condense multiple resources and the nuances that come with this technique.

Refresher on Rank of a Matrix#

Some basics first before diving into the concept. Number of linearly independent rows or columns in a matrix is known as rank of the matrix. I like to think of this in a 3D system, where each row is a vector.

Visual guide to rank(M)#

Example 1:

What is the simplest 3D system that would be linearly independent? Identity matrix I3I_{3}

I3=[100010001]I_3 = \begin{bmatrix} 1 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 1 \end{bmatrix}

Though this matrix is made of 9 numbers it's information lies across 3 dimensions. In the following representation we can clearly see each vector is linearly independent of each other.

  • Rank of this matrix is 3
  • No row can be expressed as a linear combination of the other rows

Example 2:

Similarly a set of linearly dependent vectors looks like this: M1=[100300000]M_{1} = \begin{bmatrix} 1 & 0 & 0 \\\\ 3 & 0 & 0 \\\\ 0 & 0 & 0 \end{bmatrix}

  • Rank of this matrix is 1
  • Row r2r_2 is 3 times Row r1r_1

Example 3:

Extending the same concept further we can have a matrix with rank 2: M2=[100011000]M_{2} = \begin{bmatrix} 1 & 0 & 0 \\\\ 0 & 1 & 1 \\\\ 0 & 0 & 0 \end{bmatrix}

  • Rank of this matrix is 2
  • Only the first two rows are linearly independent, the third row is zero

Key Properties of Matrix Rank:

PropertyMathematical ExpressionDescription
Upper boundrank(A)min(m,n)\text{rank}(A) \leq \min(m, n)Rank cannot exceed smallest dimension
Full vs Low rankFull: rank(A)=min(m,n)\text{rank}(A) = \min(m, n)When rank equals smallest dimension
Multiplicationrank(AB)min(rank(A),rank(B))\text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B))Product rank is bounded by minimum
Additionrank(A+B)rank(A)+rank(B)\text{rank}(A + B) \leq \text{rank}(A) + \text{rank}(B)Sum rank is bounded by sum of ranks
Subtractionrank(AB)rank(A)+rank(B)\text{rank}(A - B) \leq \text{rank}(A) + \text{rank}(B)Difference rank follows same bound as addition
Transposerank(AT)=rank(A)\text{rank}(A^T) = \text{rank}(A)Rank is preserved under transpose
Zero rankrank(A)=0    A=0\text{rank}(A) = 0 \iff A = 0Only zero matrix has rank zero

What is Low Rank Adapter?#

Brief#

LoRA adds a low-rank update to frozen pre-trained weights. Instead of updating the original weight matrix W0W_0 directly, LoRA keeps it frozen and learns a low-rank decomposition ΔW=BA\Delta W = BA to adapt the model:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

Where:

BRd×r,ARr×d,W0Rd×dB \in \mathbb{R}^{d \times r}, \quad A \in \mathbb{R}^{r \times d}, \quad W_0 \in \mathbb{R}^{d \times d}

With rdr \ll d (r much smaller than d), making the adaptation parameter-efficient.

Essentially we are trying to learn what ΔW\Delta W should be added to existing weights of a model to learn / adapt to a new task.

Visually it looks like this:

Where does rank fit in?

  • Notice rr i.e rank in above diagram
  • You choose it as a parameter, generally rdr \ll d
  • Typically r8,16,32,64r \in \\{8, 16, 32, 64\\}, and is decided empirically

But how does it imply parameter efficient training?

  • Without LoRA, learnable weights were d2d^2
  • With LoRA, we only need to learn 2dr=(d×r+r×d) 2dr = (d \times r + r \times d) weights
  • As rdr \ll d it reduces number of trainable parameters drastically

Is Rank calculated during this process?

  • NO! rank of matrix is not calculated during LoRA training or inference
  • When you choose hyperparameter rr, you are essentially saying that your matrix BB or AA will have lower rank than the original matrix W0W_0
  • Upon multiplication matrices BB and AA resultant rank will be min(rank(A),rank(B))min(rank(A), rank(B)) which will be smaller than rank(W0)rank(W_0)
  • Key takeaway: You have a rank constraint, you do NOT compute it.

What does rank rr determine?

  • How much compression you get
  • How much expressivity you retain
  • How much computation you save

Training: LoRA#

During training:

  • W0W_0 remains frozen (gradients blocked)
  • Only AA and BB matrices are updated via backpropagation
  • Forward pass: h=(W0+BA)xh = (W_0 + BA)x where xx is input

Visual guide to training LoRA#

Backward Pass

Forward Pass

Weight Matrices

Input $x$

$W_0$ Frozen
No Gradients

Matrix $A$
Trainable

Matrix $B$
Trainable

$B \times A$

$W_0 + BA$

Output $h$

$
abla A$

$
abla B$

$
abla W_0 = 0$
Blocked

Key insight: The W0=0\nabla W_0 = 0 means gradients for W0W_0 are blocked/not calculated. W0W_0 remains completely frozen during training - only the small matrices AA and BB receive gradient updates. The dotted arrow shows this blocked gradient flow.

Inference: LoRA#

During inference:

  • Option 1: Compute W0+BAW_0 + BA once and use merged weights
  • Option 2: Keep separate and compute W0x+BAxW_0x + BAx
  • Adapter swapping: Replace BABA with different adapters for different tasks

Option 1: Merged Weights#

Pre-compute the combined weights once, then use standard inference

Inference Phase

Setup Phase

Input $x$

$W_0$ Base Weights

$BA$ LoRA Weights

Merge: $W = W_0 + BA$

Forward: $Wx$

Output $h$

Option 2: Separate Computation#

Compute base and LoRA paths separately, then add their outputs

Combination

Parallel Paths

Input $x$

Base Path: $W_0x$

LoRA Path: $BAx$

Sum: $W_0x + BAx$

Output $h$

Option 3: Adapter Swapping#

Keep multiple task-specific LoRA adapters and swap them dynamically

Computation

Dynamic Selection

Base Model

Available Adapters

Task 1: $B_1A_1$

Task 2: $B_2A_2$

Task 3: $B_3A_3$

Input $x$

$W_0$ Frozen Base

Select Adapter

Selected: $B_iA_i$

$W_0 + B_iA_i$

Output $h$

todo: - write about scaling factor and why it is important - how it is initialized

What does LoRA unlock?#

LoRA's flexible inference approaches enable several powerful capabilities:

No Additional Latency#

(Option 1: Merged Weights)

  • Once weights are merged (W0+BAW_0 + BA), inference runs at original speed
  • No additional latency introduced during inference
  • Perfect for production deployment where speed matters

Massive Storage Savings#

  • Store one base model + multiple small adapters instead of full fine-tuned models
  • Example: Instead of storing 10 different 7B models (70GB total), store 1 base model (7GB) + 10 LoRA adapters (~50MB each = 500MB)
  • Result: 70GB → 7.5GB (90% storage reduction)

Dynamic Task Switching#

(Option 3: Adapter Swapping)

  • Keep multiple LoRA adapters in memory simultaneously
  • Switch between tasks instantly without reloading models
  • Diffusion model example:
    • Same base Stable Diffusion model
    • Adapter 1: Anime character style
    • Adapter 2: Oil painting style
    • Adapter 3: Photorealistic portraits
    • Switch styles in real-time based on user preference

LoRA and a small neural network#

Let's look at LoRA in action with real experiments on Llama 3.2 1B model. I conducted systematic experiments to demonstrate LoRA's effectiveness compared to traditional fine-tuning approaches.

Experimental Setup#

  • Model: meta-llama/Llama-3.2-1B (1.24B parameters)
  • Task: Sentiment analysis on IMDB dataset
  • Hardware: NVIDIA A10 (23GB VRAM)
  • Comparison: Baseline vs LoRA fine-tuning

The Reality of Training Large Models#

Baseline Approach (Frozen Layers):

# Had to freeze first 12/22 layers to fit in memory
for param in model.model.layers[:12].parameters():
    param.requires_grad = False
 
Total parameters: 1,235,818,498
Trainable parameters: 505,960,450 (40.9%)
Batch size: 16 (maximum possible)
Test Accuracy: 87.16%

LoRA Approach:

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,          # Scaling factor
    lora_dropout=0.1,       # Regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]
)
 
Total parameters: 1,247,090,690
Trainable parameters: 11,276,290 (0.9%)
Batch size: 16 (same as baseline, but could go higher)
Test Accuracy: 93.84%

The Dramatic Difference#

MetricBaselineLoRAImprovement
Trainable Params505M (40.9%)11M (0.9%)97.8% reduction
Test Accuracy87.16%93.84%+6.68%
Parameter Efficiency0.0017/M0.0832/M48.3x better

Memory Reality Check#

When I tried true full fine-tuning (all 1.24B parameters):

⚠️ CUDA OutOfMemoryError: Tried to allocate 64.00 MiB
   Even with batch size 2 - FAILED on 23GB GPU!

Meanwhile, LoRA succeeded with 8x larger batch size (16):

✅ Peak Memory: 20.57GB
✅ Training successful with much larger batches

Key insight: LoRA doesn't just make training more efficient - it makes training possible where full fine-tuning fails.

Choosing the Right Rank#

The rank rr is a crucial hyperparameter that controls the parameter-performance trade-off.

RankUse CaseTrade-off
r=8r = 8Simple tasks, maximum efficiencyMay underfit complex tasks
r=16r = 16Sweet spot for most tasksGood balance
r=32r = 32Complex tasks requiring more capacityMore parameters, still efficient
r=64r = 64Very complex tasksApproaching diminishing returns

Rule of thumb: Start with r=16r = 16, increase if underfitting, decrease if overfitting or need more efficiency.

End note#

This brings us to the end of our deep dive into LoRA, from mathematical foundations to hands-on experiments.

But why did i cover this topic? While i was Claud-ing to research about LoRA, an interesting statement was passed by the model:

"The Manifold Hypothesis states that real-world high-dimensional data lie on low-dimensional manifolds embedded within the high-dimensional space."

It got me thinking, where else can we see this principle in action?

  • Facial landmarks: 68 key points capturing the essence of infinite facial expressions
  • Image embeddings: Millions of pixels compressed into meaningful feature vectors
  • LoRA adapters: Complex model adaptations expressed through low-rank decompositions

All of these suggest that meaningful changes often happen in lower-dimensional spaces embedded within high-dimensional ones. LoRA's effectiveness might be capturing this fundamental property of how neural networks actually adapt and learn. Beautiful, isn't it?

All the experiments discussed in this blog have been open-sourced at https://github.com/sagarsrc/lora-experiments/ for you to reproduce and explore further.

I hope you enjoyed this blog. Have fun learning!

References#

Written by Sagar Sarkale