Automate Marketing Initiatives with Salesforce Marketing Cloud Learn More
Software Development Team

How to deploy chat completion model over EC2?

Looking for chat completion model over EC2 ?

The Chat Completion model revolutionizes conversational experiences by proficiently generating responses derived from given contexts and inquiries. This innovative system harnesses the power of the Mistral-7B-Instruct-v0.2 model, renowned for its sophisticated natural language processing capabilities. The model can be accessed via Hugging Face at –

Operating on a dedicated GPU server g4dn.2xlarge, self-deployed on the robust infrastructure of Amazon Web Services (AWS), the Chat Completion model ensures optimal performance and scalability. This choice of hardware configuration allows for efficient parallel processing, enabling swift response generation and seamless integration with existing chat platforms.

At its essence, the Chat Completion model’s main objective is to furnish users with accurate and contextually relevant responses. With a strict adherence to a word limit of 500, the model ensures that its outputs are concise yet informative, facilitating fluid and engaging conversational exchanges. 

By streamlining chat interactions across diverse platforms and applications, the Chat Completion model serves as a catalyst for enhancing communication efficiency and user satisfaction. Whether deployed in customer support systems, virtual assistants, or social media messaging platforms, its ability to deliver timely and pertinent responses enriches the overall conversational experience for both users and businesses alike. 

Step by Step Procedure to Install and Configure the chat completion model on EC2


Ubuntu OS

Hardware requirements: 

  • 1 x NVIDIA T4 GPU with 16 GB GPU memory 
  • 8 vCPU
  • 32 GB RAM
  • Disk space – 150GB 

Chat Completion Model Server Installation

1. Set Up Your AWS Infrastructure Using EC2

Navigate to the EC2 service on Amazon Web Services (AWS) and  click on launch instance.

2. Select the Right AMI Image

Choose a name for your instance and choose the Ubuntu AMI. Under Ubuntu, pick the Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230926 image, which comes with Ubuntu, Nvidia drivers, Cuda, and PyTorch already included.

3. Configure Your Instance Type

Ensure optimal performance by selecting the g4dn.2xlarge instance type. 

4. Deploy your EC2

Deploy your EC2 instance by clicking on the “Launch Instance” button.

5. Login into Putty server

For logging into the Putty server, the following details are required: 

  • Server IP: [insert server IP here]
  • Auto-login username: [insert auto-login username here]
  • Key for login to server: [insert key for login to server here] 

6. Check NVIDIA Configuration

Verify proper GPU resource allocation by inspecting the NVIDIA configuration with the command nvidia-smi.


         image (2) (1) (1)


7. Manually Install requirements

To prepare your environment for the chat completion model, manually install the required packages using the following command – 

“pip install git+ torch bitsandbytes accelerate” 

8. Prepare Python File

Set up a Python file containing the necessary code for interacting with the Mistral 7B , including library imports and and defining messages for the conversation. 

					from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig 
import torch 
device = “cuda” # the device to load the model onto

quantization_config = BitsAndBytesConfig(load_in_4bit=True, 
model = AutoModelForCausalLM.from_pretrained(“mistralai/Mistral-7B-Instruct-v0.1”,quantization_config=quantization_config, device_map=“auto”) 
tokenizer = AutoTokenizer.from_pretrained(“mistralai/Mistral-7B-Instruct-v0.1”) 
messages = [ 
{“role”: “user”, “content”: “What is your favourite condiment?”} 

encodeds = tokenizer.apply_chat_template(messages, return_tensors=“pt”) 
model_inputs = 
generated_ids = model.generate(model_inputs, max_new_tokens=500,   do_sample=True,pad_token_id=tokenizer.eos_token_id,eos_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)

9. Run the Python File

Execute the code with the command python3 to generate response. 


In conclusion, the chat completion model can be seamlessly accessed through a REST API by configuring an endpoint for inputting text within the request body. Once the API is set up, containerization via a Dockerfile simplifies deployment, ensuring smooth access to the model. By creating an image and launching a container on the server, Docker eliminates the need for individual installations, enhancing performance, scalability, and ease of deployment. This streamlined approach not only optimizes communication but also fosters more efficient and effective interactions within chat-based applications.

Table of Contents

Have Any Project

or Work Together?


          Keep an eye on your inbox for the PDF, it's on its way!

          If you don't see it in your inbox, don't forget to give your junk folder a quick peek. Just in case.

              You have successfully subscribed to the newsletter

              There was an error while trying to send your request. Please try again.

              Zehntech will use the information you provide on this form to be in touch with you and to provide updates and marketing.