[ad_1]
Many open source Large Language Models (LLMs) have been launched recently. These powerful models have great potential for a wide range of applications. However, one major challenge that arises is resource limitations when it comes to testing these models. While platforms like Google Colab Pro offer the ability to test models up to 7B, what are our options when we want to experiment with even larger models like 13B?
In this blog post we will see how we can run Llama 13b and openchat 13b models on a single GPU. Here we are using Google Colab Pro GPU which is T4 with 25 GB system RAM. Let’s check how to run it step by step.
Step 1:
Install Requirements You need to install acceleration and transformers from source and make sure you have installed the latest version of the bitsandbytes library (0.39.0).
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install sentencepiece
Step 2:
We use quantization techniques in our approach, using the BitsAndBytes functionality from the Transformers library. This technique allows us to perform quantization using different 4-bit options, such as NF4 (normalized float 4, which is the default) or pure FP4 quantization. With 4-bit bit and byte, the weight is stored in 4 bits, and the calculation can still be done in 16 or 32 bits. Various combinations can be selected for calculations, including float16, bfloat16, and float32.
To improve matrix multiplication and training efficiency, we recommend using a 16-bit computational dtype, the default will be torch.float32. The recent implementation of BitsAndBytesConfig in Transformers provides the flexibility to modify these settings to suit specific requirements.
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
Step 3:
After we add the configuration, now in this step we will load the tokenizer and model, here we are using Openchat model, you can use any 13b model available on HuggingFace Model.
If you want to use the Llama 13 model, then simply change the model-id to “openlm-research/open_llama_13b” and follow the steps below again
model_id = "openchat/openchat_8192"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_bf16 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
Step 4:
After loading the model, it’s time to test it. You can provide any input of your choice and also increase the “max_new_tokens” parameter to the number of tokens you want to generate.
text = "Q: What is the largest animal?\nA:"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model_bf16.generate(**inputs, max_new_tokens=35)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output:

You can use any 13b model using this quantization technique using a single GPU or Google Colab Pro.
[ad_2]
Source link