GPTQ + LoRA Finetuning from Scratch
Efficient Reasoning LLM is a minimal, transparent, and fully hand-crafted implementation of quantization-aware LoRA finetuning on top of LLaMA 2–7B using GPTQ.
This project was born out of a desire to deeply understand what happens under the hood when you combine low-rank adaptation with post-training quantization. Over a weekend, I built every component of the pipeline by hand in PyTorch, without using any external quantization or LoRA libraries.
Key Components:
- Manual GPTQ Quantization:
- Implemented block-wise GPTQ quantization from scratch using Hessian approximations and backsolve correction.
- Supported both int8 and int4 precision with optional fused 4-bit packing for memory savings.
- Saved quantized weights with per-channel
scale
andzero
points for accurate dequantization.
- Custom LoRA Layer Injection:
- Developed a custom
LoRALinear
module that can be fused with quantized layers. - Injected LoRA adapters into
q_proj
,k_proj
,v_proj
, ando_proj
layers only. - Ensured stable forward pass even when base weights are quantized.
- Developed a custom
- Finetuning and Evaluation:
- Fine-tuned the model using the GSM8K dataset with 3 epochs of supervised training.
- Achieved a 73.01% exact match accuracy, competitive with existing methods using 4-bit quantization and LoRA.
- Built an evaluation script that compares predictions to ground truth, parses answers, and computes final accuracy.
Results:
- Demonstrated that quantization-aware LoRA training can reach ~73% accuracy on GSM8K even when only a small fraction of the model is trained.
- Validated the effectiveness of training on quantized weights without full finetuning or mixed-precision hacks.
- Memory-efficient: quantized layers were as small as ~16MB (from 80MB+ original), without significant performance loss.
Future Directions:
- Implement self-consistency sampling to boost accuracy during inference.
- Add support for int4 fused kernels and quantized attention beyond MLP layers.
- Explore finetuning on larger math reasoning datasets like MetaMathQA or GSM-Hard.
- Extend this project to support verifier-based reranking and CoT reranking pipelines.
This project has been a hands-on deep dive into efficient LLM adaptation — no fluff, just real math, code, and learning.
You can explore the full codebase here:
GitHub – Efficient Reasoning LLM