QLoRa (Efficient Fine tuning of Quantized LLMs)

What is QLoRA?

On the path to fine-tuning huge LLMs (like a 65 billion parameter one) and using less memory, researchers from University of Washington developed QLoRA: Efficient Fine-tuning of Quantized LLMs. It drastically cuts down the memory needs from 780 GB to less than 48 GB, allowing it to run on a single 48GB GPU, common in high-end consumer computers. Despite using less memory, it keeps the same high performance as regular fine-tuning methods and doesn’t slow things down. QLoRA is all about quantization.

How does QLoRA work?

Quantization reduces data by converting it to a simpler form, for example, changing 32-bit floats to 8-bit integers. This stores less detail but remains close to the original. Data is scaled to fit properly in the new range.

QLoRA implements a few memory-saving techniques which are steps of its working process:

  • 4-bit NormalFloat (NF4) Quantization: This process simplifies the data by converting large amounts of information into smaller, easier-to-manage chunks. Think of it like taking high-resolution images and resizing them without losing too much quality and remaining close to the original. Data is scaled to fit properly in the new range.

  • Double Quantization: It’s a trick to shrink memory even more by simplifying how data is stored. This is a second step to further reduce memory use, which simplifies data even more, using fewer bits without losing accuracy. It quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model).

“On average, for a blocksize of 64, this quantization reduces the memory footprint per parameter from 32/64 = 0.5 bits, to 8/64 + 32/(64 · 256) = 0.127 bits, a reduction of 0.373 bits per parameter.” – From Original QLoRA paper.

  • Paged Optimizers: These are smart memory managers. When the GPU runs out of memory, paged optimizers automatically move data to the CPU and bring it back when needed. This prevents the system from crashing or slowing down due to memory spikes.