The role that NVIDIA plays in the evolution of artificial intelligence is difficult to measure. In this regard, they usually focus on solutions aimed specifically at computing centers and ultra-high-performance systems, used for a variety of purposes: from chatbot operation to complex estimation models of all types of values. In the professional environment, and while other manufacturers want to design their own integrated devices, NVIDIA reigns supreme today.
However, in our case, we are more interested in what the green giant is capable of in terms of bringing artificial intelligence closer to the end user (for a professional solution we have our friends at MuyComputerPRO), and that although most of the AI-based services that we regularly use, There are more and more operations that can be performed locally, thanks to the specialized AI computing power offered by components such as Intel Meteor Lake processors and NVIDIA GeForce RTX graphics cards. The most obvious example can be found in features like DLSS and apps like Broadcast, but the list keeps growing.
A few months ago, during Microsoft Build, Microsoft and NVIDIA announced an ambitious collaboration to bring AI Computing to the customer and today we find a very important advance in this sense that was announced at Microsoft Ignite, another key event in the Redmond calendar, an advance that promises to quadruple the performance of the hardware that already exists today. with large-scale language models (LLM).


In the chart above, you can check the number of tokens per second that a system that integrates RTX 4060 and RTX 4090 GPUs can generate when using LLMs like Llama 2 from Meta. This jump in performance comes from the new version of TensorRT-LLMa library that we already told you about a few weeks ago and which, after initially being announced for computer centers, has begun to make the jump in Windows, the result of a collaboration between Microsoft and NVIDIA, where it will play a key role in the deployment of local AI.
So, in the context of Microsoft Ignite, in addition to showing us such surprising performance data, NVIDIA did the following advertisements:
- The new version of TensorRT-LLM supports additional LLMs that can run on any GPU GeForce RTX series 30 and 40 with 8 GB of RAM or more, even on laptops.
- TensorRT-LLM will too compatible with ChatGPT APIwhich will enable hundreds of developer projects and applications to run using TensorRT-LLM on RTX.
- NVIDIA and Microsoft launch DirectML improvements to speed up Llama 2 and Stable Diffusion give developers more options for cross-vendor deployment.
Obviously, when we see the chart above, we can see that NVIDIA’s top-of-the-line GPU promises to deliver spectacular performance, but if there’s one thing that surprised me, and for the better, it’s the TensorRT-LLM will allow the use of LLM-based solutions not only in the current GeForce RTX generation, but also in the previous one. Of course, this encourages game and application developers to choose to use TensorRT-LLM in their development, as the combined group of users of RTX 30 and RTX 40 desktop graphics adapters and mobile versions is undoubtedly an extremely broad market. , which can greatly benefit from this technology.
More information