3 min read

Running LLaMA on a Macbook

Run your own GPT model with Facebook's LLaMA model on a Macbook Pro M1
Running LLaMA on a Macbook

In a previous post I explained how you can get started with the LLaMA inferencing models from Facebook. But how can we run these models? Well in this article I will show a brief introduction on how we can run them on our Macbook!

Note: I am a huge fan of Macs as they combine VRAM allowing large GPU VRAMS (e.g., a M1 Ultra)

Getting Started

As speed is of the essence here, we look at code that has been optimized to run these models. Luckily for us, Georgi Gerganov did just this!

Long live open-source! Contribute to those that make an impact.
GitHub - ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++
Port of Facebook’s LLaMA model in C/C++. Contribute to ggerganov/llama.cpp development by creating an account on GitHub.

So go ahead and clone this repository

git clone https://github.com/ggerganov/llama.cpp.git  

In this repository we have a models/ folder where we put the respective models that we downloaded earlier:

models/
  tokenizer_checklist.chk 
  tokenizer.model
  7B/
  13B/
  30B/
  65B/

Then compile the code so it is ready for use and install python dependencies

# Compile the code
cd llama.cpp
make

# Install Python dependencies
pip install torch numpy sentencepiece

And optimize the model we want to use. For optimizations, the following are used:

  • GGML FP16
  • Quantization
# Convert to GGML FP16
python convert-pth-to-ggml.py models/7B/ 1

# Quantizes to 4 bits
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

Everything is now ready to go! So let's try it

Running an Example

To run an example, we can simply run (for the 7B model):

./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 512 -p "Question"

Which for us provided:

Comparing Models

Now the models have been optimized to run on lower memory requirements (4 bits instead of 16 reducing it by 4). Let's see the required memory for each model and a sample of the output we get for several inputs.

Example: "How many days are there in a year?"

Prompt

./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -p "How many days are there in a year in one sentence:" --temp 0.9

Temperature 0.1

Model Memory Load time ms per token Output
7B 4GB 2660.61 136.63ms The number of days between the 1st January and December 31 is always going to be an odd integer.
13B 8GB 6089.64 ms 255.42 ms There is 365.24219870 days per year on average, but it can be more or less than that depending upon the position of earth around sun and moon. What does the word “per se” mean?
30B 19.5GB -missing large mac-
65B 38.5GB -missing large mac-

Temperature 0.9

Model Memory Load time ms per token Output
7B 4GB 2877ms 116ms There is 365.24 day per years, with leap year of century being February's has twenty-nine (and that can also be the reason why all four season begins and ends at different time). Except for UTC without DST while Daylight Saving Time means Spring/Summer starts in March or April(of which that is always spring, when else it will make sense?), if we are talking about real world, July's starting can not be used as justification but the Mid-September (2018) begin of summer since Fall still lasted until Winter
13B 8GB 6089.64 ms 255.42 ms How much is this coat? 3. ______ 4 feet long and can reach speeds of up to half the speed of sound.(in 1 word) a million b c d one hour e seven thousand two hundred twenty nine fice decades Question from The holly tree provided berries for many centuries, but now it is rarer.
30B 19.5GB -missing large mac-
65B 38.5GB -missing large mac-

Summary

The above shows "some" results. With the results not being as perfect as they should be on the 7B and 13B models. The model replies too verbose and are often not correct. There is a small improvement noticeable on the 13B model when the temperature is at 0.1 consistently providing the output of 365 days but with some spelling errors and were extra completions.

I'm still trying to find a larger cluster, which should allow me to run the larger models. So stay tuned!