Apple is in no hurry to join the race for general chatbots and next-generation artificial intelligence, but it is working in this direction. In particular, Apple is exploring the possibility of embedding large language models directly into users’ mobile devices.
Apple believes this option will be better for users than online access. However, large language models are called large for a reason; they require large computing resources and large amounts of RAM.
The idea of ​​the Cupertino people is to store language models in flash memory, the volume of which is one or two times larger than the volume of RAM. A method called windowing involves the model reusing some of the data it has already processed; This reduces the need to constantly sample data from memory and speeds up the entire process. The Row-Column Packing method consists of more effective grouping of data, which enables the artificial intelligence model to read data from flash memory faster and accelerate its learning.
These methods should make it possible to speed up the model by up to five times when using the processor and up to 25 times when using the GPU.