Deploying Llama 3 to GCP
Introduction
This tutorial guides you through deploying Llama 3 to Google Cloud Platform (GCP) Vertex AI using Magemaker and querying it using the interactive dropdown menu. Ensure you have followed the installation steps before proceeding.
You may need to request a quota increase for specific machine types and GPUs in the region where you plan to deploy the model. Check your GCP quotas before proceeding.
Step 1: Setting Up Magemaker for GCP
Run the following command to configure Magemaker for GCP Vertex AI deployment:
This initializes Magemaker with the necessary configurations for deploying models to Vertex AI.
Step 2: YAML-based Deployment
For reproducible deployments, use YAML configuration:
Example YAML for GCP deployment:
For gated models like llama from Meta, you have to accept terms of use for model on hugging face and adding Hugging face token to the environment are necessary for deployment to go through.
Selecting an Appropriate Instance
For Llama 3, a machine type such as n1-standard-8
with an attached NVIDIA T4 GPU (NVIDIA_T4
) is a suitable configuration for most use cases. Adjust the instance type and GPU based on your workload requirements.
If you encounter quota issues, submit a quota increase request in the GCP console under “IAM & Admin > Quotas” for the specific GPU type in your deployment region.
Step 3: Querying the Deployed Model
Once the deployment is complete, note down the endpoint id.
You can use the interactive dropdown menu to quickly query the model.
Querying Models
From the dropdown, select Query a Model Endpoint
to see the list of model endpoints. Press space to select the endpoint you want to query. Enter your query in the text box and press enter to get the response.
Or you can use the following code:
Conclusion
You have successfully deployed and queried Llama 3 on GCP Vertex AI using Magemaker’s interactive dropdown menu. For any questions or feedback, feel free to contact us at support@slashml.com.