Size, quality and performance of LLMs: keys to their optimization
October 23, 2024
0
LLM, short for large language models, have become one of the most important pillars artificial intelligence. When properly trained, these models are capable of doing many things such
LLM, short for large language models, have become one of the most important pillars artificial intelligence. When properly trained, these models are capable of doing many things such as writing text documents, generating images, summarizing content using text, image or video analysis, and can also work as chatbots and customer service agents.
The size of an LLM usually greatly affects its quality because it determines number of parameters with which he is able to work. For example, an LLM with 1 billion parameters is worse than an LLM with 70 billion parameters, but the former consumes much less resources than the latter. This leads us to the necessary distinction between models that can work locally and others that are limited to the cloud.
Using the LLM locally has important advantages because we will not need an internet connection and we will also enjoy it higher level of security this requires local execution, but we will need hardware that meets the requirements of that model, because otherwise it may not work, or it may work very slowly and the user experience may not be good. Cloud models free us from the issue of demands at the hardware level.
LLM Optimization via GPU
If you want to improve performance and enjoy the optimal experience of doing LLM locally, it is ideal to have with a very powerful GPU, that it has hardware specialized for AI acceleration and that it has a large amount of graphics memory. The larger the number of parameters, the higher the quality of the LLM, and more power and more graphics memory will also be required for the model to function.
The amount of graphics memory required is usually the limiting factor in many cases it prevents local execution of large language models with billions of parameters, but fortunately there is a solution to this problem if we want to run LLM locally on our PC or laptop, and it’s called LM Studio.
Imagine you have a PC with an NVIDIA GeForce RTX graphics card that has 8GB of graphics memory and you want to run LLM on your computer, but you don’t have enough graphics memory to load it. With LM Studio it is possible to customize and break it up into small pieces that represent the different layers of the model. These blocks are uploaded and released to the GPU when needed, and we have the option to choose how many of these blocks we want to process on the GPU.
Let’s look at an example. Imagine we want to load an LLM like Gemma 2, which has 27 billion parameters. This model requires about 13.5 GB of graphics memory at 4-bit quantization, a figure to which we would have to add 1 GB to 5 additional GB as possible overhead derived from its own operation. In total we would need about 19 GB of graphics memory to accelerate this model, which would limit its availability to the GeForce RTX 4090.
Well, thanks to the GPU download feature we can use with LM Studio derive only a specific part of Gemma 2 27B and accelerate it on the GPU. So if we have a graphics card with only 8 GB of graphics memory, we can use it to speed up part of this model and improve performance. In the attached tables, we see the maximum ratio we can achieve with different LLMs and NVIDIA graphics cards, as well as the performance improvement.
Donald Salinas is an experienced automobile journalist and writer for Div Bracket. He brings his readers the latest news and developments from the world of automobiles, offering a unique and knowledgeable perspective on the latest trends and innovations in the automotive industry.