In Choosing the right LLM we explored how to select an LLM based on the model architecture and type; depending on the use case, a specific architecture and type may be better suited than others, mainly when talking about the quality of output.
Another major criterion beyond the quality of results when selecting an LLM is cost, and we will delve into the cost considerations when selecting LLMs and explore the differences between running them in the cloud versus on-premises, with a special emphasis on CPU versus GPU and their effects on the speed of inference and latency.
The cost of inference
The choice between running LLMs in the cloud or on-premises can significantly impact the cost and performance of inference. Cloud service providers like Azure, AWS, GCP, Anthropic, and OpenAI, among others, offer managed services for Generative AI, simplifying the process of provisioning and running models. By leveraging cloud infrastructure, companies can quickly deploy LLMs and access a curated set of models for their tasks.
On the other hand, on-premises deployment provides more control over the infrastructure and may be preferred when dealing with sensitive data that requires strict data privacy measures. However, setting up and maintaining the infrastructure can be resource-intensive and may not be as cost-effective as utilizing cloud services.
We can find two main types of cost structures when working with LLMs: pay-as-you-go and cost of infrastructure.
Pay-as-you-go (Managed Service)
Cloud service providers typically offer pricing models for LLMs based on usage, which can include factors such as the number of inferences, model complexity, and data transfer. The pay-as-you-go approach allows businesses to scale their usage based on demand, making it a flexible and cost-effective option for many applications. However, it is crucial to carefully analyze the pricing structure of each cloud provider to avoid unexpected costs and ensure cost optimization.
Cloud LLMs (Open AI API, Anthropic Claude API, etc.) are offered as managed services and therefore are accessed through REST APIs. Usage, and therefore cost, is generally based on the number of tokens/words, tokens being pieces of words that will depend on the chosen tokenization scheme (a rule of thumb is to consider a token to be 4 characters in English).
Figure 1. OpenAI Tokenizer – Example of text and corresponding tokens.
As an example, the Open AI API will charge a specific amount for the number of input tokens, and another amount for the number of output tokens. The sum of all tokens (input + generated output) is considered a Completion and the request cost will equal the Completion cost. On the other hand, Anthropic’s Claude API only charges for the generated output size and uses words instead of tokens as the quantifying metric.
On-Premises / Cloud Infrastructure
Open-Source, as well as some proprietary models, may be free to use but they require an infrastructure to run. An effective, low-latency on-premises architecture would ideally run on GPUs and a rule of thumb is to have 2x the number of parameters, therefore a 7B model would require approximately 14GB of VRAM if using 16-bit precision (quantization can drive the required VRAM down). Keep in mind that one of the popular A100 chips costs roughly $10,000, and a query/inference may take 5 seconds (when properly batching LLM requests, throughput can be improved to at least 2.5 queries per second). These numbers are an over-simplification of costs and latency, and can change rapidly based on chip availability, underlying architecture, and optimizations.
When it comes to speed of inference and latency, the choice between CPU and GPU can be crucial. LLMs often require significant computational power, and GPUs excel in parallel processing, making them ideal for accelerating inference tasks. The use of GPUs can lead to faster results and reduced latency compared to CPU-based inference.
However, GPUs are generally more expensive to operate than CPUs, which can significantly impact the cost of inference, especially when dealing with large-scale language models and high-throughput applications. Balancing the performance benefits of GPUs with the cost implications is essential to optimize the overall efficiency of the system.
OpenAI API Cost Estimation
Although a managed service can help organizations quickly build LLM-powered applications, scaling usage to thousands or millions of requests can become a major expense. Let’s run a quick example using the OpenAI API.
For this example we will focus on a Retrieval Augmented Generation (RAG) use case, as Zero-Shot may not necessarily be suited for enterprise applications where proprietary information or data the model is not trained on, is required.
The OpenAI API is priced by usage. At the time of this writing, the 8K Context model costs $0.03 / 1K input tokens and $0.06 / 1K output tokens. As per OpenAI documentation, pricing is calculated based on completion requests, which means that the total cost for a request is based on the number of input tokens plus the number of output tokens returned by the API.
A small RAG request may contain 2,000 input tokens and 1,500 output tokens, this translates to a cost of $0.06 for the input and $0.09 for the output, making the completion cost $0.06 + $0.09 = $0.15. If you are processing a small number of transactions per day this may not add to a lot, but the cost scales linearly with the number of requests.
Figure 2. OpenAI API cost estimation.
If the numbers shown are interpreted as the daily usage you can imagine how costs can quickly escalate when considering a yearly expense, and this only considers querying the LLM hosted by OpenAI!
AWS Cost Estimation
Now let’s consider hosting the Falcon-40B model in AWS. Based on the rule of thumb provided earlier we need at least 2x the number of parameters for VRAM, or at least 80GB VRAM. The Falcon-40B model in fact requires approximately ~90GB of GPU memory. The recommended instance is ml.g5.24xlarge, providing 4x NVIDIA A10G GPUs, each supporting 96 GIB of GPU Memory.
The cost for the ml.g5.24xlarge instance in the us-east-1 region is $7.09 per hour and one instance can serve 6 requests per second. This means that one instance can serve 1 request per second easily (or ~518k requests daily).
Figure 3. Falcon-40B AWS single instance cost estimation.
In real-world settings and to provide redundancy and reliability it is recommended to deploy at least 2 instances. One can be set on-demand to still achieve high reliability while reducing costs significantly.
Optimization: Speed vs. Cost
When hosting an LLM, either on-premises or in the cloud, optimizing for specific use cases involves finding the right balance between speed and cost. In applications where low latency is critical, prioritizing GPU-based inference can lead to improved user experience and higher performance. However, this may come at a higher cost due to GPU usage fees.
For less time-sensitive tasks or cost-sensitive scenarios, CPU-based inference or on-premises deployment might be more suitable, as they can offer a more economical solution without sacrificing significant performance.
In enterprise settings, quantization techniques can play a vital role in optimizing cost and performance. Quantization involves reducing the precision of model weights and activations, which can significantly reduce the memory and computational requirements during inference. This optimization technique can lead to faster inference and lower operating costs, making it particularly beneficial for resource-constrained environments.
Example use cases for quantization include chatbots, language translation, and text summarization, where real-time performance is crucial, and cost optimization is essential to achieve scalability.
Choosing the right LLM for a specific use case involves considering various factors beyond the quality of results. The cost of inference, latency, and hosting infrastructure all play a significant role in the decision-making process. Cloud deployment offers convenience and scalability but may come with higher costs, especially when utilizing GPUs for inference. On-premises deployment provides more control over data privacy but requires careful management of infrastructure and resources.
The examples shown in this article show that 1,000 requests a day may be a good threshold to consider when debating when to use a managed service such as the OpenAI API or a hosted LLM. While managed services cost scale linearly with usage, hosted LLMs do not, allowing organizations to keep expenses under control while at the same time having some degree of control over the model and the data being used to run inferences.
By carefully evaluating the trade-offs between quality, speed, data privacy, and cost, businesses can leverage LLMs effectively to achieve their objectives. Furthermore, optimization techniques such as quantization can further enhance performance and cost-efficiency in enterprise settings. As the landscape of LLMs continues to evolve, understanding these considerations will be crucial for successful and cost-effective implementations.
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.
Rodrigo Vargas is a leader of Encora’s Data Science & Engineering Technology Practice. In addition, he oversees project delivery for the Central & South America division, guaranteeing customer satisfaction through best-in-class services.
Rodrigo has 20 years of experience in the IT industry, going from a passionate Software Engineer to an even more passionate, insightful, and disruptive leader in areas that span different technology specializations that include Enterprise Software Architecture, Digital Transformation, Systems Modernization, Product Development, IoT, AI/ML, and Data Science, among others.
Passionate about data, human behavior and how technology can help make our lives better, Rodrigo’s focus is on finding innovative, disruptive ways to better understand how data democratization can deliver best-in-class services that drive better decision making. Rodrigo has a BS degree in Computer Science from Tecnológico de Costa Rica and a MSc in Control Engineering from Tecnológico de Monterrey.