- #Artificial Intelligence
• 35 min read
LLLMs: Local Large Language Models
This article discusses the evolution of Local Large Language Models (LLMs) and their potential applications. In recent years, neural network development has progressed significantly, enabling these models to be used for a variety of tasks. Foundation models, as referred to by the Center for Research on Foundation Models, are trained on broad data and can be fine-tuned for various tasks. The article reviews the LLaMA family of LLMs developed by Meta AI, the RedPajama and Alpaca variants, MPT family by MosaicML, and GPT-J by EleutherAI. It further elucidates concepts such as quantization which allows running these models on consumer-grade devices, discusses potential issues in running these models, and compares solutions like renting cloud power or building one's own server to calculate the cost of using a prompt. The article concludes with potential integration possibilities of these models.
Developing neural networks (NNs) is an ongoing process, and in the last 70+ years, we have seen a lot of progress.
The first and simplest NNs of the 1950s — the perceptrons — could only do some simple binary classification. As more complex architectures of neural networks were proposed, the technology applications have expanded to natural text processing and image recognition, like handwritten postal ZIP code numbers recognition (1989).
Around mid 2000s the long short-term memory networks could already tackle speech and handwriting recognition and text-to-speech synthesis. Another breakthrough came about 10 years later with autoencoders and generative adversarial networks, now able not only to classify input text or images but produce them too.
However, until 2017 and the transformer architecture that allowed the use of orders of magnitude larger datasets for NN training, the dominating approach was basically "one network, one application." Large language models based on the transformer architecture changed that, and now, instead of building a specialized tool (that can only do 1 task), we take a generic tool and adapt it to the current task (but the same tool can do more tasks).
In August 2021, the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) introduced the term "foundation model." This term refers to any model that undergoes broad data training, typically through self-supervision at scale, and can be adapted or fine-tuned for a wide range of downstream tasks. The concept of a foundation model encompasses its ability to serve as a versatile and adaptable tool across various applications.
As of writing this (summer 2023), many LLMs are fine-tuned to be instructional or conversational — primarily for human-computer interaction improvement. Fine-tuning is collecting task- or domain-specific data and using it to train the foundational model to make it more specialized for a task or allow for outputs of a particular format.
The weights of the model and its number of layers might change to incorporate this additional data.
For instance, ChatGPT is a fine-tuned chat model based on the foundational GPT-3.5 model and allows for prompting in the form of a dialogue. Instructional-tuned models allow to exploit the property of the LLMs of being a few-shot learner system for transfer learning: it is enough to show several input-output examples in the prompt, and the model will follow the desired output format for the next inputs.
Here are some examples of typical prompts to a foundation/chat/instructional models, left to right:
Most of the 2022 LLM race was "who has the largest model," with model sizes reaching hundreds of billions of parameters:
Things changed in early 2023 when Meta presented LLaMA — a much smaller LLM (1 order of magnitude less in the number of parameters compared to GPT-3 and later OpenAI models). Most importantly, LLaMA could run on a single GPU with comparable performance to GPT-3.
LLaMA official release was without the model weights and only included the code to run the model and reproduce experiments and examples. Anyone interested in getting model weights needed to join the waitlist. However, the weights were leaked on 4chan, and within a few days, tech enthusiasts posted tutorials on how to run LLaMA locally on a Windows PC, a M1 Mac, a Google Pixel smartphone, and a Raspberry Pi.
The need for a commercially-usable solution gave birth to several variants of local LLMs, which we briefly review below.
Meta AI develops the LLaMA family of LLMs. The key assumption behind this model was that more training data — and not a larger model in terms of number of parameters — is a key to better LLM performance. It proved correct, and LLaMA could outperform models almost x100 of its size.
Meta released the complete LLaMA 2 family of models with weights about 6 months later, and this version is available for free for research and commercial use. LLaMA 2 most closely matches GPT-3.5 if we compare the benchmark performance:
As the development team notes: "LLaMA 2 models are trained on 2 trillion tokens and have double the context length of LLaMA 1. LLaMA-2-chat models have additionally been trained on over 1 million new human annotations."
Apart from the double context size, the biggest change from LLaMA is a very permissive license of LLaMA 2 that enables commercial use immediately.
- USD 500 went towards synthetic conversational data generation with OpenAIs models, and
- USD 100 towards cloud GPU rental for the actual training
Basically, Alpaca to LLaMA 1 is what ChatGPT is to GPT-3.5.
In parallel to Alpaca development, a collaboration of several open-source AI community member organizations (Ontocord.ai, ETH DS3Lab, AAI CERC, Université de Montréal, MILA - Québec AI Institute, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group, and LAION) led by Together.ai successfully reproduced LLaMA from scratch.
The first major independent contribution was a 5TB dataset of over 1.2 trillion tokens, which others have used to train even more models like MPT, OpenLLaMA, or OpenAlpaca.
The collaboration also presented 6 models — the foundation and instruction/chat fine-tuned options in the sizes 3B and 7B parameters.
As the collaboration notes: "... the training was done on 3,072 V100 GPUs provided as part of the INCITE 2023 project on Scalable Foundation Models for Transferrable Generalist AI, awarded to MILA, LAION, and EleutherAI in fall 2022, with support from the Oak Ridge Leadership Computing Facility (OLCF) and INCITE program."
Together.ai also provides an interactive fine-tuning cost estimate calculator on their website, now also listing the LLaMA 2 models.
OpenLLaMA is an LLM open-sourcing effort from Berkeley AI Research that relies on the RedPajama dataset and also replicates the LLaMA training procedure.
The v1 models are trained on the RedPajama dataset, while the v2 models are trained on a combination of the Falcon refined-web dataset, the StarCoder dataset, and the RedPajama dataset's portions from Wikipedia, arXiv, books, and StackExchange. The datasets appear to consist of approximately 1 trillion tokens in size.
OpenLLaMA exhibits comparable performance to the original LLaMA and comes in 3B, 7B, and 13B sizes.
It is open source, available for commercial use, and MPT-7B matches the quality of LLaMA-7B and was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k.
The competitive advantage of this model is the large context size of 65k+ tokens, which, as of writing this, is not available within the latest GPT-4 model of OpenAI, which has 32k tokens context.
MPT is developed by MosaicML, an open-source startup with neural network expertise that has built a platform for organizations to train large language models and deploy generative AI tools based on them. The company said that its latest release, MPT-30B, "has showcased how organizations can quickly build and train their own state-of-the-art models using their data in a cost-effective way." The company was recently acquired by Databricks.
GPT-J (and Dolly)
It is developed by EleutherAI, a non-profit AI research lab that focuses on interpretability and alignment of large models.
The competitive advantage of this model is its size — with only 6B parameters, it is still competitive with much larger models like GPT-3-175B.
As Wikipedia puts it, "GPT-J-6B performs almost as well as the 6.7 billion parameter GPT-3 (Curie) on a variety of tasks... It even outperforms the 175 billion parameter GPT-3 (Davinci) on code generation tasks... With fine-tuning, it outperforms an untuned GPT-3 (Davinci) on a number of tasks."
In March 2023, Databricks released Dolly, an Apache-licensed, instruction-following model created by fine-tuning GPT-J on the Stanford Alpaca dataset.
|LLM||Model type||Model sizes||Training corpus size, tokens||License||Commercial application|
|LLaMA||Foundation||7B, 13B, 33B, 65B||1T||LLaMA LICENSE AGREEMENT||🚫|
|LLaMA-2||Foundation, Chat||7B, 13B, 70B||2T||LLaMA 2 License Agreement||✅|
|Alpaca||Chat||7B||-||LLaMA LICENSE AGREEMENT||🚫|
|RedPajama||Foundation, Chat||3B, 7B||1.2T||Apache 2.0||✅|
|OpenLLaMA||Foundation||3B, 7B, 13B||1T||Apache 2.0||✅|
|MPT||Foundation, Instruct, Chat, StoryWriter (up to 84k context)||7B, 30B||1T||Apache 2.0||✅ (except the Chat models)|
LLaMA 2, as of now, is one of the most powerful language models, available for free for research and commercial use. So, we take it as a baseline in our practical part and will compare it with a few open LLMs in terms of generation quality. We will also compare it with OpenAI's models for computation costs.
Before we proceed, let's review some crucial concepts for the practical use of such models on consumer-grade electronics.
It is important to remember that under the hood, it is all huge matrix multiplication, so we care most about the efficiency of basic arithmetic operations and memory space to store all those numbers.
The size of a model is determined by the number of its parameters and their precision, typically represented as
To calculate the model size in bytes, we multiply the number of parameters by the chosen precision's size in bytes.
For instance, with the
bfloat16 version of the BLOOM-176B model, we have 176 billion parameters, resulting in
Quantization in machine learning refers to the conversion of data from floating point 16 or 32 bits to a lower precision format like integer 8 bit. This process involves performing critical operations, such as Convolution, in integer precision, and then converting the lower precision output back to higher precision in floating point representation. By optimizing the precision of the data, quantization helps improve efficiency and performance in ML models.
Utilizing lower-bit quantized data minimizes data movement, both on-chip and off-chip, leading to reduced memory bandwidth and significant energy savings. By employing lower-precision mathematical operations, such as an 8-bit integer multiply instead of a 32-bit floating point multiply, energy consumption is reduced, and compute efficiency is increased, resulting in lower power consumption. Additionally, reducing the bit representation of the neural network's parameters results in decreased memory storage requirements.
Modern quantization methods allow to minimize LLM output quality degradation while allowing to run them on consumer-grade devices. In some cases (like below), we can do so without using the parallel processing capabilities of GPUs and vRAM, and relying only on CPU and RAM.
LLM playground options
Apart from the Huggingface hub, there are some dedicated web playgrounds to test out open LLMs:
There are several ways to run LLaMA (and other) models with zero or minimal configuration, just out-of-the-box:
And there are several options for a minimal build-and-run workflow:
- ollama if you are interested in a web interface to run models from LLaMA family
- text-generation-webui, openplayground are web interfaces that support building and running different model backends
- llama.cpp and its forked/related setups, to run it interactively in the Terminal
- pyllama hacked version of LLaMA based on original Facebook's implementation but more convenient to run on a single consumer grade GPU
- alpaca-lora if you want to build and run instruct-tuned LLaMA on consumer hardware
Local CPU inference
Here is a brief summary of comparing generation quality for a few open LLLMs that we did:
|LLaMA||7B||llama.cpp||CPU||int4||Apple M1 Pro 16G RAM|
|LLaMA 2||7B||llama.cpp||CPU||int4, f16||Apple M1 Pro 16G RAM|
|LLaMA 2||70B||llama.cpp||CPU, GPU||int4||Apple M1 Pro 16G RAM, Apple M2 Pro 32G RAM|
|RedPajama-INCITE-Base||7B||redpajama.cpp||CPU||int4||Apple M1 Pro 16G RAM|
|MPT||7B||ggml||CPU||int4||Apple M1 Pro 16G RAM|
|GPT-J||6B||ggml||CPU||int4, f16||Apple M1 Pro 16G RAM|
This local zoo takes 225G of the disk space if for each model you store the full and the quantized model versions.
You can also investigate
ggml repository from the same author.
It's a tensor library for machine learning that written in C and optimized for Apple Silicon, and a foundation for llama.cpp.
So, let's finally run some models?
MacBook Pro M1 16GB, macOS Ventura
LLaMA 2, 7B
LLaMA 2, 7B, F16
MacBook Pro M1 16GB, macOS Ventura
MacBook Pro M1 16GB, macOS entura
MacBook Pro M1 16GB, macOS Ventura
GPT-J 6B F16
CPU+RAM VS GPU+Metal: LLaMA 2 70B
Now let's switch to the MacBook Pro M2 Max with 32 GB RAM and 12 total cores (8 performance and 4 efficiency). If we run LLaMA 2 70B like other models, we get the error of failing to load the model. Below, we can see the Terminal window with the main execution command to run the LLaMA 2 70B model and the error as a result.
To fix this error and run model we need to add
-gqa 8 parameter to main execution command.
If you want to know more about it, you can dive deep into this issue on GitHub.
After updating our main execution command, we can see that Llama 2 70B is working. You can see below the corresponding screenshot.
We want to know which option is better for a local device to run Llama 2 70B: CPU+RAM or GPU+Metal. See the videos below for each configuration.
Llama 2 70B running using CPU+RAM
The original video was 22 minutes. We speed up it to 20x faster.
Llama 2 70B running using GPU+Metal
The original video was 25 minutes. We speed up it to 20x faster.
As you can see, using LLaMA 2 70B on MacBook Pro M2 Max is better with GPU+Metal configuration.
We will compare LLaMA 2 and OpenAI model performance using three prompts with generation, translation, and summary actions.
Generate a list of top technology trends for the next 5 years for a software development company.
Llama 2 7B
Llama 2 7B Chat
Llama 2 13B
Llama 2 13B Chat
Take this text "In Technological R&D we generate new knowledge and out-of-the-box ideas that the company can use to create new tech, products, or approaches or improve existing initiatives." and translate it into Ukrainian.
Llama 2 7B
Llama 2 7B Chat
Llama 2 13B
Llama 2 13B Chat
Summarize the following text "We’re working hard to be in the avant-garde of software technology and to be aware of the future. We don't limit ourselves by borders of any sort, not even our imagination. We want to see beyond our regular domain. When doing research, we don’t know what we may find in the end. Will it be a breakthrough or a disappointment? We never know. Thus, we’re also learning to fail and accept our failures. These failures then form the deep-rooted substratum that yields our future discoveries. While doing our work, we aspire to foster MacPaw’s innovation culture. Our insights can serve as proof of concept for our friends’ unconventional ideas to be later implemented in our products."
Llama 2 7B
Llama 2 7B Chat
Llama 2 13B
Llama 2 13B Chat
How much is the prompt?
It's exciting to calculate how much is the prompt cost. We must clarify that we will calculate the cost of 1K output tokens. In general, 1K tokens is about 750 words in English. We will compare a SaaS solution like using OpenAI API with one based on maintaining local LLMs. So, let's begin!
Reference: OpenAI API
We will use as a baseline the GPT-3.5 Turbo model because it has a similar context to LLaMA 2 models. GPT-4 model is much more powerful than others, and the comparison will not be adequate. You can find actual pricing for different OpenAI models here.
See below the current pricing for the GPT-3.5 Turbo model:
|4K context||0.0015 USD / 1K tokens||0.002 USD / 1K tokens|
|16K context||0.003 USD / 1K tokens||0.004 USD / 1K tokens|
For our investigation we used price for tokens on output:
|4K context||0.002 USD / 1K tokens|
Scenario 1: renting cloud power
Run a local LLM in the cloud.
We will use a M1 MacBook Pro with 16GB RAM as a proxy to run the model and ensure that it works correctly Then, we will find similar device configuration for renting cloud power and calculate cost of 1K tokens.
We have only 1 query at a time, nothing is parallelized.
We run LLaMA 2 7B with 4K context on CPU using RAM with llama.cpp.
You can see below resource use and generation speed:
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_print_timings: load time = 7529,09 ms llama_print_timings: sample time = 672,13 ms / 866 runs (0,78 ms per token, 1288,43 tokens per second) llama_print_timings: prompt eval time = 1878,48 ms / 15 tokens (125,23 ms per token, 7,99 tokens per second) llama_print_timings: eval time = 30151,76 ms / 865 runs (34,86 ms per token, 28,69 tokens per second) llama_print_timings: total time = 32795,35 ms
While running, the model used 12 GB RAM. Let’s see how much it will cost to rent a GPU with similar performance. We will use this resource for GPU Cloud Server Comparison. This website contains a table with different cloud server configurations and prices. For example:
A similar device configuration for the reference M1 MacBook Pro is AWS with K80 (12 GB) GPU Type. It consists of 1 GPU with 12 GB GPU RAM. Let's calculate the cost of 1K tokens.
There are 3600 seconds in an hour. Using our cloud server AWS K80 (12 GB) with 1 GPU (12 GB) costs 0.90 USD/3600s.
We will use 28 tokens per second as a default speed. Let's calculate how many tokens our model can produce per hour: 28 * 3600 = 100 800 tokens per hour (disregard prompt, output, any other time).
So, our model can produce 100800 t/hour, and renting cloud power is 0.90 USD/hour. Let's find out how much it costs to generate 1000 tokens:
Price of 1000 tokens = 0.90 USD/hour * 1000 t / (100800 t/hour) = 0.008928571429 USD.
Remember that we considered a LLaMA 2 7B model. We assume that running a larger Llama 2 70B model will not be cheaper, so our numbers serve as a lower bound on the cost.
To sum up, renting the cloud for the Llama 2 7B model is x4 more expensive than the OpenAI Saas solution, which costs $0.002 / 1K tokens.
Scenario 2: building our own server
Run a LLM on our own server.
First of all, we need to understand that the significant cost components for this scenario are:
- electricity consumption
- hardware (the graphic processor)
We want to build our own server and find a device configuration to run the LLaMA 2 70B model. Then, we will calculate the cost of 1K tokens.
- We have only 1 query at a time; nothing is parallelized;
- We disregard other costs (other hardware, maintenance, human ML expertise, web access);
- We consider running on GPU — so not through llama.cpp on CPU.
After surveying the practitioner experiences on thematic forums, we have found this post specifying that to want to run LLaMA 2 70B locally on a server, we need to use NVIDIA GeForce RTX 3090 (4090) graphic processors with GPU RAM 24 GB. The average price for it is 2000 USD. But we need two graphic processors at least to fit the model, so GPU hardware costs rise to about 4000 USD.
Let's find out how much it costs to generate 1000 tokens using our own server. In the same above post, the author said about the token generation speed: "Both cards are 24gb, I'm using 14gb on first card and 20 on second, and getting 9t/s".
We want to calculate how much time our server would need to generate 1K tokens. For simplicity, let's round numbers up as 9t/s ~ 10t/s, and then 1000t/100t/s = 100s ~ 2 minutes. Our server can generate 1K tokens per 2 minutes.
Next, electricity consumption for NVIDIA GeForce RTX 3090 is 360 W-h/h. Remember that our server consists of two graphic processors at least. So, total electricity consumption is 2*360 W-h/h. To generate 1K tokens, we will consume 0,72 kW-h/ 3600 s * 100 s = 0,02 kW-h.
What about electricity prices? We take average Kyiv region price 5000 UAH / MW-h or 5 UAH / kW-h. Let's convert it to USD. Assuming the exchange rate as 1 USD = 36.90 UAH, the electricity price is 0,136 USD / kW-h.
In this case, generating 1K tokens will cost 0,02 kW-h * 0,136 USD/kW-h = 0,00272 USD.
Solution price comparison
Let's sum up and put the results in a table:
|Solution||Model||Price of 1K tokens|
|OpenAI||GPT-3.5 Turbo||0.002 USD|
|Renting cloud power||LLaMa 2 7B||0.0089+ USD|
|Building own server||LLaMa 2 70B||0.0027+ USD|
We would like to remind you that it is correct to compare Meta’s LLaMA 2 70B with OpenAI’s GPT-3.5 Turbo. Meta’s LLaMA 2 7B model is simpler and much smaller (but runs on a modern laptop), so comparing it with OpenAI’s GPT-4 is not correct.
We came to the conclusion that with the current pricing policies, using OpenAI is cheaper than renting cloud power and building our own server.
Local language models (LLMs) hold great promise, particularly those with commercial licenses. The future of this technology looks bright, especially with the news about the potential use of local LLMs on smartphone devices, and industry interest in developing dedicated chips.
Currently, there is a surge in interest in this tech, resulting in high demand for hardware, which significantly affects prices. OpenAI's current pricing policy makes it an appealing choice for the short and medium term. However, for long-term considerations such as data privacy, local LLMs may soon become a more attractive and competitive option.
Any product that uses the OpenAI API could in theory use a self-hosted instance.
Large Language Models 101: History, Evolution and Future (image 1 source)
Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901. arXiv preprint
Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." (2023) arXiv preprint
Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." (2023) arXiv preprint