Fine-tuning Small Language Models (SLM) is an essential step in adapting them to specific tasks and/or improving their performance. However, the challenge lies not solely in the fine-tuning itself, but in doing so efficiently, especially when computational resources are limited, and cost-effectiveness is a key consideration.
With the mind set of exploring the nuances of fine-tuning SMLs, the Encora Generative AI team ran more than 100 training sessions aiming to explore the following Parameter-Efficient Fine-tuning Techniques (PEFT): LoRA, QLoRA, DoRA, and QDoRA. Our goal was to optimize fine-tuning while keeping resource usage in check.
This article provides some of the insights obtained during our experiments, focusing on the process, resources, and results of fine-tuning an SML within a resource-constrained environment.
Overview of Fine-Tuning Techniques
LoRA (Low-Rank Adaptation) fine-tunes smaller matrices instead of the entire model, reducing the number of trainable parameters. DoRA (Weight-Decomposed Low-Rank Adaptation) decomposes the weight matrix into magnitude and direction, allowing for fine-tuning with improved performance. QLoRA and QDoRA take this further by introducing quantization to these matrices, saving even more memory.
For a deeper dive into the theory and intuition behind these techniques, including an intuitive explanation of how they work, please see our previous post: "Comparing Fine-tuning Optimization Techniques (LoRA, QLoRA, DoRA, and QDoRA)." This will provide a solid foundation for understanding the principles that guide these fine-tuning methods.
Experiment Set Up
To gain deeper insights into the application of these optimization techniques, we conducted a series of experiments using the Phi-3-Mini-4k-Instruct model and the Hugging Face ecosystem.
Our goal was to evaluate the effectiveness of LoRA, DoRA, QLoRA, and QDoRA in a resource-constrained environment. The experiments were performed on an Nvidia Tesla v100 16GB GPU, and we focused on metrics such as memory usage, training time, and model performance.
The following findings highlight our observations on the effectiveness of these techniques under these practical limitations.
Results and Insights
Implementation Details
During the experiments, we observed that all techniques had a straightforward implementation thanks to their compatibility with existing frameworks. With the Hugging Face’s PEFT library developers can leave most of the complexity behind the scenes and focus only on configuration.
This makes introducing LoRA to training as simple as adding a configuration, and with that as a foundation, implementing QLoRA, DoRA, and QDoRA requires only the integration of additional layers to your code. The process mirrors the nested design of Russian Matryoshka dolls, each one interconnected, with one building seamlessly on top of another.
Our findings:
- The ease of LoRA implementation comes from the Hugging Face’s PEFT library.
- In theory, implementing QLoRA is slightly more complex than standard LoRA (which, as previously mentioned, is not difficult to implement) due to the additional quantization steps. However, the BitsAndBytesConfig part of the bitsandbytes library, makes it pretty straightforward. All developers really need to set up is the quantization configuration.
- By the time we conducted these training sessions, DoRA was already integrated into the Transformers library, so we did not have to dive into the original repository as we initially thought. The adaptation is as simple as changing a couple of configurations in the LoRA objects.
- QDoRA adds that quantization-enabling object back into the mix.
The hyperparameters used have a major impact on the resources required for training. Therefore, the real challenge while working in an environment with limited resources is to choose them right. During that process developers are likely to encounter memory errors, especially if the GPU is not the right size.
This leads us to the inevitable question: What size GPU is needed?
Memory
Determining the ideal GPU size is not trivial. There is no one-size-fits-all equation, and there are several factors to consider.
If we had a 4B model in FP16 format (meaning two bytes for each parameter), we would need a minimum memory of 8 GB just to store the model parameters. However, this 8 GB represents only the bare minimum memory required to store the model's weights.
In practice, the actual minimum memory is one that can accommodate three key components:
- The model itself (8 GB in this case)
- The gradients (which typically match the size of the model, so another 8 GB)
- A batch of input data (size varies depending on the batch size and input dimensions)
Therefore, a more realistic minimum memory requirement would be at least 16 GB for full fine-tuning. By utilizing LoRA or its variants, the amount of memory required for gradients can be significantly reduced, which alleviates some of the memory pressure.
One of the first lessons we learned when conducting these training sessions was that obtaining the recommended GPUs (NVIDIA A100, NVIDIA A6000, NVIDIA H100) proved to be a challenge due to the high commercial demand and limited availability. These GPUs, while superior in performance, are often difficult to access on cloud platforms like Azure and Google Colab.
We ended up utilizing an Nvidia Tesla v100 16GB, which was the optimal option available in Azure Machine Learning Studio. The V100, while powerful, has less memory and computational capacity compared to the A100. This decision played a central role throughout the whole project, influencing our choices and strategies at every stage.
Working with a 3.8B parameter model, we faced significant limitations in our testing due to memory constraints. A V100 is insufficient for fully fine-tuning a 4B model, making PEFT techniques not just beneficial but necessary.
Through our experiments, we frequently ran into Out-Of-Memory (OOM) issues. We discovered that the hyperparameters that most heavily impact memory usage, regardless of technique, are:
- Batch size: This refers to the number of training examples used in one iteration. Larger batch sizes consume more memory as more data is processed simultaneously.
- By conducting an initial test using a batch size of 1 to measure memory utilization, one can gain a baseline understanding of the training memory demands.
- Context window. This is the amount of previous text the model can consider at once. A larger context window allows the model to take more information into account but requires additional memory to store this data.
- A key strategy for managing OOM errors on the V100 was reducing the context window.
- By being aware of the average token length in the data, it becomes evident that using a 2048-token context window might not be necessary if the data is tokenized to an average of 700 tokens. Reducing the context window to 1024 tokens can free up a significant amount of memory.
Here are some results focusing on the memory metrics in our experiments. We were fine-tuning using an effective batch size of 2, 1 epoch, and nearly 20K examples from the Open-Platypus dataset:
Memory metrics in V100 experiments
Main observations:
- The models were effectively using the GPU resources. All techniques show high GPU utilization.
- QLoRA has the highest memory allocation at 11.78 GB, followed by QDora. Surprisingly, both techniques, despite being designed to optimize memory usage through quantization, consume more memory than anticipated.
- While the theoretically expected order was DoRA, LoRA, QDoRA, QLoRA (from highest to lowest memory usage), the actual results on the V100 GPU revealed a different order: QLoRA, QDoRA, DoRA, LoRA, challenging initial expectations.
As noted in the table, the quantized versions often showed higher memory usage than expected. This counterintuitive observation underscores the need for close attention to the practical realities of working with limited hardware.
The quantization process, while ultimately saving memory, requires an initial overhead that can be significant, especially on small and already saturated GPUs. While memory-saving techniques like QLoRA and QDoRA are valuable, they add layers of complexity to the training process.
However, it’s important to note that the initial memory overhead experienced would likely be more than offset by the overall memory savings on GPUs with greater capacity, allowing for the handling of much larger models or increased batch sizes.
Overall, in regard to memory, the main take aways are:
- A key strategy for managing OOM errors on the V100 was to reduce the context window.
- Unexpected behavior was observed where quantized versions consumed more memory, attributing to adapter occupation.
- Hyperparameter adjustments had minimal effect on memory usage compared to context window changes.
- For optimal performance, one must always prioritize GPUs with more memory over those with faster processing speeds but limited memory. Memory capacity is crucial, significantly affecting the context window size and batch size.
- GPUs with less than 24GB memory are likely to encounter these listed memory-related issues.
Trainable Parameters
As we discussed in our previous blog, we anticipated savings in the number of trainable parameters compared to Full Fine-tuning, which involves updating all the parameters of the model. In our pilot study, we observed the following savings:
Running time for the experiments using the V100
The expectation based on theory was that the order from fastest to slowest should be: LoRA > QLoRA > DoRA > QDoRA. However, the V100 experiments surprised us, revealing QLoRA as the fastest technique, outperforming even LoRA.
This unexpected result highlights how specific hardware constraints, like those of the V100, can influence the performance of different fine-tuning techniques.
As the results were surprising, we secured access to an A100 GPU on Google Colab Pro+ for the final set of experiments, allowing us to compare its performance with the V100.
Running time for the experiments using the A100
Here the order we got was precisely as the theory anticipated: LoRA > QLoRA > DoRA > QDoRA.
LoRA outperformed the other techniques, completing training approximately 15% faster than QLoRA, 20% faster than DoRA, and 29% faster than QDoRA. This performance boost highlights the efficiency of LoRA on more powerful hardware.
Main observations from contrasting A100 vs V100:
- The time differences between techniques were more pronounced on the A100 compared to the V100, highlighting the impact of advanced hardware on optimizing fine-tuning processes.
- According to the radio calculations, all techniques demonstrate that the A100 GPU is 3 to 4 times faster than V100.
- LoRA: 609 min (V100) / 153 min (A100) ≈ 3.98
- QLoRA: 563 min (V100) / 180 min (A100) ≈ 3.13
- DoRA: 631 min (V100) / 191 min (A100) ≈ 3.30
- QDoRA: 713 min (V100) / 216 min (A100) ≈ 3.30
- The ratios are fairly consistent across techniques, suggesting that each technique scales well with increased GPU power.
- The fact that QLoRA was faster than LoRA on the V100 but slower on the A100 suggests that quantization overhead has a more noticeable impact when computational resources are limited (V100). This overhead is less of an issue on the A100, where LoRA’s simplicity allows it to take the lead.
Evaluation Time
We consider it valuable to include the evaluation time across different benchmarks. It’s important to note that all evaluation times were measured using the A100 GPU. This decision was made after observing time estimates in the logs. Given our earlier finding that the V100 is approximately three times slower than the A100, evaluation on the V100 would have led to impractically long sessions.
Evaluation time in benchmarks using a A100
- We can see that QDoRA had the longest total evaluation time at 7 minutes.
- HellaSwag was consistently the most time-consuming benchmark across all techniques. MMLU also required considerable time.
- LoRA had the fastest total evaluation time at 3 minutes, demonstrating its efficiency not only in training but also during evaluation.
- QLoRA had a total evaluation time of 0 minutes, slightly longer than LoRA but still relatively efficient. The differences between QLoRA and LoRA were minor, reinforcing that while QLoRA introduces some overhead, it remains competitive in evaluation time.
Overall, regarding speed, the main discovers were:
- Training speed hierarchy: LoRA (fastest) > QLoRA > DoRA > QDoRA (slowest). LoRA was approximately 15% faster than QLoRA, 20% faster than DoRA, and 29% faster than QDoRA.
- Training duration varies significantly with model size: smaller models take hours to half a day, while larger models extend to multiple days.
- LoRA and QLoRA had faster evaluation times compared to QDoRA and DoRA. In the HellaSwag benchmark, which had the longest evaluation time, LoRA's evaluation was approximately 59.95% faster.
- Evaluating models using general benchmarks can sometimes be almost as resource-intensive and time-consuming as the training process itself.
Cost
With the training and evaluation times established and the knowledge that GPU usage is billed on an hourly basis; we can now calculate the costs associated with each technique. By comparing the costs, we gain a clearer understanding of the financial implications of using different fine-tuning methods on various GPUs.
Training Costs
Training costs for the experiments using the V100
Observations:
- QLoRA had the lowest total cost at $28.72, which correlates with its shorter training time.
- QDoRA had the highest total cost at $36.38, due to its longer training time.
- LoRA remains competitive, with only slightly higher costs than QLoRA, suggesting that while it’s slightly more expensive, it may still be a good balance between cost and simplicity.
Training costs for the experiments using the A100
Observations:
- LoRA had the lowest total cost at $9.37, making it the most cost-effective technique on the A100.
- QDoRA had the highest total cost at $13.19, due to its longer training time. Similar to the V100 results, QDoRA remains the most expensive technique.
Main observations from contrasting A100 vs V100:
- LoRA’s efficiency on the A100 is even more pronounced compared to the V100, with the lowest costs across all metrics, making it a clear choice for cost-conscious scenarios.
- QLoRA and DoRA both offer moderate cost savings, with QLoRA slightly more cost-effective than DoRA, which aligns with their respective training times.
- The cost savings are more substantial on the A100 across all techniques, reflecting the better performance and lower costs per unit of work on this more advanced hardware.
Evaluation costs
Using the A100 GPU, we calculated the evaluation costs for each technique across the MMLU, TruthfulQA, and HellaSwag benchmarks.
Evaluation costs using a A100
Observations:
- LoRA had the lowest total evaluation cost at $4.30, correlating with its shortest total evaluation time (70.3 minutes).
- QDoRA incurred the highest evaluation cost at $9.40, reflecting its longer total evaluation time (153.7 minutes).
- HellaSwag was the most time-consuming benchmark across all techniques, particularly for QDoRA (76.0 minutes) and DoRA (75.6 minutes), which contributed significantly to their higher overall evaluation costs.
We then calculated the corresponding evaluation costs if they were to be run on a V100 GPU. We applied the assumption discovered during our training speed analysis; that the A100 is 3 to 4 times faster than the V100. For these calculations, we used a factor of 3.5 to estimate the costs.
Projection: Evaluation costs using a V100
This projection highlights how leveraging more advanced hardware like the A100 can lead to substantial savings in both time and cost
Cost Relationship Based on Data Percentage
We conducted experiments using the LoRA technique exclusively, varying the percentage of data from 5% to 100%. The goal was to understand how scaling the amount of data impacts the overall cost, particularly when running on the V100 GPU.
Cost Analysis Based on Data Percentage on V100
This analysis shows that the costs scale predictably with the amount of data used. This linearity was first detected during our initial training experiments, where we recorded training times.
We strongly recommend starting training small, as there is no predefined formula to calculate these costs. Starting with a smaller data percentage provides valuable guidance for projecting the time and costs of larger scale training sessions, allowing for more informed planning and budgeting.
Performance
Benchmark Evaluation
We assessed the models' performance across several benchmarks, focusing on MMLU, TruthfulQA, and HellaSwag. These benchmarks were chosen to represent different aspects of model understanding and reasoning.
Benchmark evaluation scores
- All techniques achieved the same MMLU score of 0.679. The choice of technique did not influence performance on this benchmark.
- On TruthfulQA, LoRA stands out with the highest score of 47.63, outperforming the other techniques by a noticeable margin.
- LoRA also achieved the highest HellaSwag score of 0.581, slightly ahead of DoRA (0.573).
- On HellaSwag QLoRA and QDoRA both had lower scores (0.561), suggesting that quantization might slightly affect the performance in this benchmark.
Moreover, these generic benchmarks may not always accurately reflect performance in specific use cases. Focusing on use case-specific evaluation ensures relevance and precision and is more resource-efficient.
Perplexity
We analyzed perplexity scores, which measure how well the models predict the next word in a sequence. Lower perplexity indicates higher confidence and accuracy.
Perplexity scores
Analysis Perplexity-Benckmark scores
- LoRA has the highest TruthfulQA score (47.63) and also the lowest eval/loss (0.66094) and perplexity (1.9364). This suggests that a lower eval/loss correlates with better performance on the TruthfulQA benchmark.
- LoRA again has the highest HellaSwag score (0.581) and the lowest eval/loss. This further supports the observation that lower eval/loss is associated with better benchmark performance.
Comparing the perplexity across different experiments, we observed a consistent pattern where the order of performance follows: LoRA > DoRA > QLoRA > QDoRA. This pattern is also reflected in the benchmarks, indicating that for this experiment, this ordering reflects the effectiveness of each method in reducing model uncertainty and improving prediction accuracy.
Key Takeaways
These experiments highlighted the essential trade-offs between memory efficiency, training speed, and model performance. LoRA and DoRA consistently demonstrated strong performance across metrics, making them the most reliable options. However, their quantized counterparts, QLoRA and QDoRA, revealed their overhead challenges in this resource-constrained environment.
Our findings also confirmed that while the V100 GPU is functional, it is significantly less efficient than the A100 GPU in terms of both time and cost. This efficiency gap emphasizes the importance of selecting the right hardware.
As developers move forward with fine-tuning small language models, we recommend starting with smaller datasets. This approach not only allows for better projections of time and costs but also ensures the ability to scale up efficiently and confidently models and resources are refined.
Furthermore, while benchmarks and perplexity metrics provide valuable insights, they may not always accurately reflect performance in specific use cases. Focusing on use case-specific evaluations ensures that your model is more relevant for your end-user.
The choice of technique should, therefore, be carefully aligned with the specific needs of the task, available resources, and desired outcomes.