Introduction

This tutorial guides you through deploying Llama 3 to Google Cloud Platform (GCP) Vertex AI using Magemaker and querying it using the interactive dropdown menu. Ensure you have followed the installation steps before proceeding.

You may need to request a quota increase for specific machine types and GPUs in the region where you plan to deploy the model. Check your GCP quotas before proceeding.

Step 1: Setting Up Magemaker for GCP

Run the following command to configure Magemaker for GCP Vertex AI deployment:

magemaker --cloud gcp

This initializes Magemaker with the necessary configurations for deploying models to Vertex AI.

Step 2: YAML-based Deployment

For reproducible deployments, use YAML configuration:

magemaker --deploy .magemaker_config/your-model.yaml

Example YAML for GCP deployment:

deployment: !Deployment
  destination: gcp
  endpoint_name: llama3-endpoint
  accelerator_count: 1
  instance_type: n1-standard-8
  accelerator_type: NVIDIA_T4
  num_gpus: 1
  quantization: null

models:
  - !Model
    id: meta-llama/Meta-Llama-3-8B-Instruct
    location: null
    predict: null
    source: huggingface
    task: text-generation
    version: null

For gated models like llama from Meta, you have to accept terms of use for model on hugging face and adding Hugging face token to the environment are necessary for deployment to go through.

Selecting an Appropriate Instance

For Llama 3, a machine type such as n1-standard-8 with an attached NVIDIA T4 GPU (NVIDIA_T4) is a suitable configuration for most use cases. Adjust the instance type and GPU based on your workload requirements.

If you encounter quota issues, submit a quota increase request in the GCP console under “IAM & Admin > Quotas” for the specific GPU type in your deployment region.

Step 3: Querying the Deployed Model

Once the deployment is complete, note down the endpoint id.

You can use the interactive dropdown menu to quickly query the model.

Querying Models

From the dropdown, select Query a Model Endpoint to see the list of model endpoints. Press space to select the endpoint you want to query. Enter your query in the text box and press enter to get the response.

Or you can use the following code:

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
import json
from dotenv import dotenv_values


def query_vertexai_endpoint_rest(
    endpoint_id: str,
    input_text: str,
    token_path: str = None
):
    import google.auth
    import google.auth.transport.requests
    import requests

    # TODO: this will have to come from config files
    project_id = dotenv_values('.env').get('PROJECT_ID')
    location = dotenv_values('.env').get('GCLOUD_REGION')

    
    # Get credentials
    if token_path:
        credentials, project = google.auth.load_credentials_from_file(token_path)
    else:
        credentials, project = google.auth.default()
    
    # Refresh token
    auth_req = google.auth.transport.requests.Request()
    credentials.refresh(auth_req)
    
    # Prepare headers and URL
    headers = {
        "Authorization": f"Bearer {credentials.token}",
        "Content-Type": "application/json"
    }
    
    url = f"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/endpoints/{endpoint_id}:predict"
    
    # Prepare payload
    payload = {
        "instances": [
            {
                "inputs": input_text,
                # TODO: this also needs to come from configs
                "parameters": {
                    "max_new_tokens": 100,
                    "temperature": 0.7,
                    "top_p": 0.95
                }
            }
        ]
    }
    
    # Make request
    response = requests.post(url, headers=headers, json=payload)
    print('Raw Response Content:', response.content.decode())

    return response.json()

endpoint_id="your-endpoint-id-here"

input_text='What are you?"'
resp = query_vertexai_endpoint_rest(endpoint_id=endpoint_id, input_text=input_text)
print(resp)

Conclusion

You have successfully deployed and queried Llama 3 on GCP Vertex AI using Magemaker’s interactive dropdown menu. For any questions or feedback, feel free to contact us at support@slashml.com.