Llama 2: run Meta’s Open Source Large Language Model for free on IPUs

Llama-2 is the latest open-source Large Language Model (LLM) from Meta. It has been described as a game-changer for adoption and commercialisation of LLMs because of its comparable performance with much larger models and its permissive open-source license that allows its use and distribution in commercial applications.

Now, Graphcore is making the model available to run – for free – using Paperspace Gradient Notebooks.

What is Llama-2?

Llama 2 is a large language model that offers significant improvements over its predecessor, which was released in February 2023. While the original Llama was only licensed for research use, Llama-2 is free to use, even for commercial applications, under a new open-source licence.

The latest version of Llama is trained on 40% more data and offers, according to Meta’s accompanying paper, performance on par with proprietary models such as ChatGPT and Palm, and better than some of the most widely hailed open source LLMs.

The Llama-2 model repository offers a variety of mode sizes, namely 7B, 13B, and 70B parameters which makes deployment of applications viable without having to spend a fortune on infrastructure. There are also accompanying variants, fine-tuned for chat-type interaction, using human feedback.

How can I use Llama-2 for free on IPUs?

At Graphcore we believe in the importance of open-source models in realising the transformative power of AI. They fast-track innovation and breakthroughs, allowing users to build cutting-edge products and services with the minimal burden of starting from scratch or relying on costly proprietary models.

We are committed to bringing the latest, most exciting models to Graphcore IPU users as soon as possible, so we’re delighted to make the 7bn and 13b parameter versions of Llama-2.

The accompanying notebook guides you through creating and configuring an inference pipeline on Graphcore IPUs using a Paperspace Gradient notebook which pre-packages libraries and prerequisite files so you can get started with Llama-2 easily.

New users can try out Llama-2 on a free IPU-Pod₄, with Paperspace’s six-hour free trial. For a higher performance implementation, you can scale up to an IPU-Pod₁₆.

Llama-2

Inference on IPUs

1. Request access to the model weights

To download and use the pre-trained Llama-2 base model and fine-tuned Llama-2-chat checkpoints, you will need to authenticate the Hugging Face Hub and create a read access token on the Hugging Face website. You will need this token access when prompted after executing the following cell:


from huggingface_hub import notebook_login  
notebook_login()

Llama-2 is open-sourced under the Meta license which you’ll need to accept to get access to the model weights and tokenizer. The model cards in Hugging Face hub are also gated models, so you will need to request access through the model cards (see llama-2-7b-chat, llama-2-13b-chat).

2. Select Model Weights

Meta released 7B, 13B,and 70B parameter versions of the model. Llama-2-7B and Llama-2-13B fits in our Paperspace free tier environment, using a Graphcore IPU-Pod₄ system. Users can also scale up to a paid IPU-Pod₁₆ system for faster inference.

Screengrab 1

3. Create Inference Pipeline

Screengrab 1

Our next goal is to set up the inference pipeline. We'll specify the maximum sequence length and the micro batch size. Before deploying a model on IPUs, it needs to be translated into a format that's executable through a compilation process.

This step takes place when the pipeline is created. It is crucial that all input dimensions are defined prior to this compilation. We have supplied the necessary configurations allowing you to run everything seamlessly. This compilation step could take 15-30 minutes depending on the size of the model, but we also provide the pre-compiled execution file in the cache to bring this step down to a minute.

Selecting a longer sequence length or larger batch size will use more IPU memory. This means that increasing one may require you to decrease the other. As your next task, try enhancing the system's efficiency by adjusting these hyperparameters.

Remember, if you make changes, the pipeline will require recompilation which could take between 10-30 minutes depending on the size of the model.

If you would like to try another model with different number of parameters, select a different checkpoint in the previous cell and then re-run this step to load the correct model to the IPUs.

4. Run Inference

Call the llama_pipeline object you have just created to generate text from a prompt.

Screengrab 2

There are some optional parameters to the pipeline call you can use to control the generation behaviour:

temperature – Indicates whether you want more or less creative output. A value of 1.0 corresponds to the model's default behaviour. Temperatures greater than 1.0 flatten the next token distribution making more unusual next tokens more likely. The temperature must be zero or positive.

k – Indicates that only among the highest k probable tokens can be sampled. This is known as "top k" sampling. Set to 0 to disable top k sampling and sample from all possible tokens. The value for k must be between a minimum of 0 and a maximum of config.model.embedding.vocab_size which is 32000. The default is 5.

output_length - Sets a maximum output length in tokens. Generation normally stops when the model generates its end_key text, but can be made to stop before that by specifying this option. A value of None disables the limit. The default is 'None'.

Screengrab 3

You can start with a different user input and play around with the optional parameters. For instance, let’s use the prompt “How do I get help if I am stuck on a deserted island?”, set temperature=0.4, and set top-K sampling to 10.

Prompting Llama 2

One of the advantages of using your own open source LLM is the ability to control the system prompt which is inaccessible in models served behind APIs and chat applications. This lets you pre-define the behaviour of your generative text agent, inject personality, and provide a more streamlined experience for an end-customer or a client application.

To see the full system prompt and format, you can call the last_instruction_prompt attribute on the pipeline.

Let us see the default prompt format as described in this Llama 2 prompting guide.

Screengrab 4

Running Llama-2-chat on non-Paperspace IPU environments

To run the demo using IPU hardware other than in Paperspace, you can get the code from this repository.

Refer to the Getting Started guide for your system for details on how to enable the Poplar SDK. Also refer to the Jupyter Quick Start guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

Is Llama-2 right for me?

Llama-2 is a very powerful model for building your own generative text and chat applications, it comes with a very competitive performance and a permissive license for research and commercialisation. The good performance of Llama-2 with relatively smaller memory footprint makes it a very viable and cost-effective model for wide adoption and deployment in production.

It is easy to access on IPUs on Paperspace and it can be expanded and deployed to handle a variety of applications.

You can contact Graphcore to discuss building and scaling applications using Lllama in the cloud, or anything else.