Running LLaMA on a Macbook
In a previous post I explained how you can get started with the LLaMA inferencing models from Facebook. But how can we run these models? Well in this article I will show a brief introduction on how we can run them on our Macbook!
Note: I am a huge fan of Macs as they combine VRAM allowing large GPU VRAMS (e.g., a M1 Ultra)
Getting Started
As speed is of the essence here, we look at code that has been optimized to run these models. Luckily for us, Georgi Gerganov did just this!
Long live open-source! Contribute to those that make an impact.
So go ahead and clone this repository
git clone https://github.com/ggerganov/llama.cpp.git
In this repository we have a models/
folder where we put the respective models that we downloaded earlier:
models/
tokenizer_checklist.chk
tokenizer.model
7B/
13B/
30B/
65B/
Then compile the code so it is ready for use and install python dependencies
# Compile the code
cd llama.cpp
make
# Install Python dependencies
pip install torch numpy sentencepiece
And optimize the model we want to use. For optimizations, the following are used:
- GGML FP16
- Quantization
# Convert to GGML FP16
python convert-pth-to-ggml.py models/7B/ 1
# Quantizes to 4 bits
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
Everything is now ready to go! So let's try it
Running an Example
To run an example, we can simply run (for the 7B model):
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 512 -p "Question"
Which for us provided:
Comparing Models
Now the models have been optimized to run on lower memory requirements (4 bits instead of 16 reducing it by 4). Let's see the required memory for each model and a sample of the output we get for several inputs.
Example: "How many days are there in a year?"
Prompt
./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -p "How many days are there in a year in one sentence:" --temp 0.9
Temperature 0.1
Model | Memory | Load time | ms per token | Output |
---|---|---|---|---|
7B | 4GB | 2660.61 | 136.63ms | The number of days between the 1st January and December 31 is always going to be an odd integer. |
13B | 8GB | 6089.64 ms | 255.42 ms | There is 365.24219870 days per year on average, but it can be more or less than that depending upon the position of earth around sun and moon. What does the word “per se” mean? |
30B | 19.5GB | -missing large mac- | ||
65B | 38.5GB | -missing large mac- |
Temperature 0.9
Model | Memory | Load time | ms per token | Output |
---|---|---|---|---|
7B | 4GB | 2877ms | 116ms | There is 365.24 day per years, with leap year of century being February's has twenty-nine (and that can also be the reason why all four season begins and ends at different time). Except for UTC without DST while Daylight Saving Time means Spring/Summer starts in March or April(of which that is always spring, when else it will make sense?), if we are talking about real world, July's starting can not be used as justification but the Mid-September (2018) begin of summer since Fall still lasted until Winter |
13B | 8GB | 6089.64 ms | 255.42 ms | How much is this coat? 3. ______ 4 feet long and can reach speeds of up to half the speed of sound.(in 1 word) a million b c d one hour e seven thousand two hundred twenty nine fice decades Question from The holly tree provided berries for many centuries, but now it is rarer. |
30B | 19.5GB | -missing large mac- | ||
65B | 38.5GB | -missing large mac- |
Summary
The above shows "some" results. With the results not being as perfect as they should be on the 7B and 13B models. The model replies too verbose and are often not correct. There is a small improvement noticeable on the 13B model when the temperature is at 0.1
consistently providing the output of 365 days but with some spelling errors and were extra completions.
I'm still trying to find a larger cluster, which should allow me to run the larger models. So stay tuned!