Machine Learning

How to deploy chat completion model over EC2?

April 23, 2024

Looking for chat completion model over EC2 ?

The Chat Completion model revolutionizes conversational experiences by proficiently generating responses derived from given contexts and inquiries. This innovative system harnesses the power of the Mistral-7B-Instruct-v0.2 model, renowned for its sophisticated natural language processing capabilities. The model can be accessed via Hugging Face at – https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2.

Operating on a dedicated GPU server g4dn.2xlarge, self-deployed on the robust infrastructure of Amazon Web Services (AWS), the Chat Completion model ensures optimal performance and scalability. This choice of hardware configuration allows for efficient parallel processing, enabling swift response generation and seamless integration with existing chat platforms.

At its essence, the Chat Completion model’s main objective is to furnish users with accurate and contextually relevant responses. With a strict adherence to a word limit of 500, the model ensures that its outputs are concise yet informative, facilitating fluid and engaging conversational exchanges.

By streamlining chat interactions across diverse platforms and applications, the Chat Completion model serves as a catalyst for enhancing communication efficiency and user satisfaction. Whether deployed in customer support systems, virtual assistants, or social media messaging platforms, its ability to deliver timely and pertinent responses enriches the overall conversational experience for both users and businesses alike.

Step by Step Procedure to Install and Configure the chat completion model on EC2

Prerequisites

Ubuntu OS

Hardware requirements:

1 x NVIDIA T4 GPU with 16 GB GPU memory
8 vCPU
32 GB RAM
Disk space – 150GB

Chat Completion Model Server Installation

1. Set Up Your AWS Infrastructure Using EC2

Navigate to the EC2 service on Amazon Web Services (AWS) and click on launch instance.

2. Select the Right AMI Image

Choose a name for your instance and choose the Ubuntu AMI. Under Ubuntu, pick the Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230926 image, which comes with Ubuntu, Nvidia drivers, Cuda, and PyTorch already included.

3. Configure Your Instance Type

Ensure optimal performance by selecting the g4dn.2xlarge instance type.

4. Deploy your EC2

Deploy your EC2 instance by clicking on the “Launch Instance” button.

For logging into the Putty server, the following details are required:

Server IP: [insert server IP here]
Auto-login username: [insert auto-login username here]
Key for login to server: [insert key for login to server here]

6. Check NVIDIA Configuration

Verify proper GPU resource allocation by inspecting the NVIDIA configuration with the command nvidia-smi.

7. Manually Install requirements

To prepare your environment for the chat completion model, manually install the required packages using the following command –

“pip install git+https://github.com/huggingface/transformers torch bitsandbytes accelerate”

8. Prepare Python File

Set up a Python file containing the necessary code for interacting with the Mistral 7B , including library imports and and defining messages for the conversation.

				
					from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig 
import torch 
device = “cuda” # the device to load the model onto

quantization_config = BitsAndBytesConfig(load_in_4bit=True, 
   bnb_4bit_quant_type=“nf4”, 
   bnb_4bit_compute_dtype=torch.float16, 
           ) 
model = AutoModelForCausalLM.from_pretrained(“mistralai/Mistral-7B-Instruct-v0.1”,quantization_config=quantization_config, device_map=“auto”) 
tokenizer = AutoTokenizer.from_pretrained(“mistralai/Mistral-7B-Instruct-v0.1”) 
messages = [ 
{“role”: “user”, “content”: “What is your favourite condiment?”} 

] 
encodeds = tokenizer.apply_chat_template(messages, return_tensors=“pt”) 
model_inputs = encodeds.to(device) 
model.to(device) 
generated_ids = model.generate(model_inputs, max_new_tokens=500,   do_sample=True,pad_token_id=tokenizer.eos_token_id,eos_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

9. Run the Python File

Execute the code with the command python3 your_file_name.py to generate response.

In conclusion, the chat completion model can be seamlessly accessed through a REST API by configuring an endpoint for inputting text within the request body. Once the API is set up, containerization via a Dockerfile simplifies deployment, ensuring smooth access to the model. By creating an image and launching a container on the server, Docker eliminates the need for individual installations, enhancing performance, scalability, and ease of deployment. This streamlined approach not only optimizes communication but also fosters more efficient and effective interactions within chat-based applications.

Table of Contents