How to fine tune a LLM using Cloud Run Jobs

How to fine tune a LLM using Cloud Run Jobs

About this codelab

subjectLast updated Jun 22, 2025
account_circleWritten by a Googler

1. Introduction

Overview

In this codelab, you will use Cloud Run jobs to finetune a Gemma 3 model, then serve the result on Cloud Run using vLLM.

What you'll do

Train a model to respond to a specific phrase with a specific result using the KomeijiForce/Text2Emoji dataset, established as part of EmojiLM: Modeling the New Emoji Language.

After training, the model responds to a sentence prefixed with "Translate to emoji: ", with a series of emoji corresponding to that sentence.

What you'll learn

  • How to conduct fine tuning using Cloud Run Jobs GPU
  • How to serve a model using Cloud Run with vLLM
  • How to use Direct VPC configuration for a GPU Job for faster upload and serving of the model

2. Before you begin

Enable APIs

Before you can start using this codelab, enable the following APIs by running:

gcloud services enable run.googleapis.com \
    compute.googleapis.com \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    secretmanager.googleapis.com \
    artifactregistry.googleapis.com

GPU Quota

Review the GPU Quota documentation to confirm how to request quota.

If you encounter any "You do not have quota for using GPUs" errors, confirm your quota on g.co/cloudrun/gpu-quota.

Note: If you are using a new project, it may take a few minutes between enabling the API and having the quotas appear in the quota page.

Hugging Face

This codelab uses a model hosted on Hugging Face. To get this model, request for the Hugging Face user access token with "Read" permission. You will reference this later as YOUR_HF_TOKEN.

To use the gemma-3-1b-it model, you must agree to the usage terms.

3. Setup and Requirements

Set up the following resources:

  • IAM service account and associated IAM permissions,
  • Secret Manager secret to store your Hugging Face token,
  • Cloud Storage bucket to store your fine-tuned model, and
  • Artifact Registry repository to store the image you'll build to fine-tune your model.
  1. Set environment variables for this codelab. We pre-populated a number of variables for you. Specify your project ID, region, and Hugging Face token.
    export PROJECT_ID=<YOUR_PROJECT_ID>
    export REGION=<YOUR_REGION>
    export HF_TOKEN=<YOUR_HF_TOKEN>

    export NEW_MODEL=gemma-emoji
    export AR_REPO=codelab-finetuning-jobs
    export IMAGE_NAME=finetune-to-gcs
    export JOB_NAME=finetuning-to-gcs-job
    export BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
    export SECRET_ID=HF_TOKEN
    export SERVICE_ACCOUNT="finetune-job-sa"
    export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
  2. Create the service account by running this command:
    gcloud iam service-accounts create $SERVICE_ACCOUNT \
     
    --display-name="Service account for fine-tuning codelab"
  3. Use Secret Manager to store Hugging Face access token:
    gcloud secrets create $SECRET_ID \
         
    --replication-policy="automatic"

    printf $HF_TOKEN
    | gcloud secrets versions add $SECRET_ID --data-file=-
  4. Grant your service account the role of Secret Manager Secret Accessor:
    gcloud secrets add-iam-policy-binding $SECRET_ID \
     
    --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
     
    --role='roles/secretmanager.secretAccessor'
  5. Create a bucket that will host your fine-tuned model:
    gcloud storage buckets create -l $REGION gs://$BUCKET_NAME
  6. Grant your service account access to the bucket:
    gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
     
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
     
    --role=roles/storage.objectAdmin
  7. Create an Artifact Registry repository to store the container image:
    gcloud artifacts repositories create $AR_REPO \
       
    --repository-format=docker \
       
    --location=$REGION \
       
    --description="codelab for finetuning using CR jobs" \
       
    --project=$PROJECT_ID

4. Create the Cloud Run job image

In the next step, you'll create the code that does the following:

  • Imports the Gemma model from Hugging Face
  • Performs fine tuning on the model with the dataset from Hugging Face. The job uses single L4 GPU for fine tuning.
  • Uploads the fine-tuned model called new_model to your Cloud Storage bucket
  1. Create a directory for your fine tuning job code.
    mkdir codelab-finetuning-job
    cd codelab
    -finetuning-job
  2. Create a file called finetune.py
    # Copyright 2025 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #      http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.

    import os

    import torch
    from datasets import load_dataset
    from peft import LoraConfig, PeftModel
    from transformers import (
       
    AutoModelForCausalLM,
       
    AutoTokenizer,
       
    BitsAndBytesConfig,
       
    TrainingArguments,
    )
    from trl import SFTTrainer

    # Cloud Storage bucket to upload the model
    bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

    # The model that you want to train from the Hugging Face hub
    model_name = os.getenv("MODEL_NAME", "google/gemma-3-1b-it")

    # The instruction dataset to use
    dataset_name = "KomeijiForce/Text2Emoji"

    # Fine-tuned model name
    new_model = os.getenv("NEW_MODEL", "gemma-emoji")

    ############################ Setup ############################################

    # Load the entire model on the GPU 0
    device_map = {"": torch.cuda.current_device()}

    # Limit dataset to a random selection
    dataset = load_dataset(dataset_name, split="train").shuffle(seed=42).select(range(1000))

    # Setup input formats: trains the model to respond to "Translate to emoji:" with emoji output.
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def format_to_chat(example):
       
    return {
           
    "conversations": [
               
    {"role": "user", "content": f"Translate to emoji: {example['text']}"},
               
    {"role": "assistant", "content": example["emoji"]},
           
    ]
       
    }

    formatted_dataset = dataset.map(
       
    format_to_chat,
       
    batched=False,                        # Process row by row
       
    remove_columns=dataset.column_names,  # Optional: Keep only the new column
    )

    def apply_chat_template(examples):
       
    texts = tokenizer.apply_chat_template(examples["conversations"], tokenize=False)
       
    return {"text": texts}

    final_dataset = formatted_dataset.map(apply_chat_template, batched=True)

    ############################# Config #########################################

    # Load tokenizer and model with QLoRA configuration
    bnb_4bit_compute_dtype = "float16"  # Compute dtype for 4-bit base models
    compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

    bnb_config = BitsAndBytesConfig(
       
    load_in_4bit=True,  # Activate 4-bit precision base model loading
       
    bnb_4bit_quant_type="nf4",  # Quantization type (fp4 or nf4)
       
    bnb_4bit_compute_dtype=compute_dtype,
       
    bnb_4bit_use_double_quant=False,  # Activate nested quantization for 4-bit base models (double quantization)
    )

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
       
    model_name,
       
    quantization_config=bnb_config,
       
    device_map=device_map,
       
    torch_dtype=torch.float16,
    )
    model.config.use_cache = False
    model.config.pretraining_tp = 1

    ############################## Train ##########################################

    # Load LoRA configuration
    peft_config = LoraConfig(
       
    lora_alpha=16,     # Alpha parameter for LoRA scaling
       
    lora_dropout=0.1,  # Dropout probability for LoRA layers,
       
    r=8,               # LoRA attention dimension
       
    bias="none",
       
    task_type="CAUSAL_LM",
       
    target_modules=["q_proj", "v_proj"],
    )

    # Set training parameters
    training_arguments = TrainingArguments(
       
    output_dir="./results",
       
    num_train_epochs=1,
       
    per_device_train_batch_size=1,  # Batch size per GPU for training
       
    gradient_accumulation_steps=2,  # Number of update steps to accumulate the gradients for
       
    optim="paged_adamw_32bit",
       
    save_steps=0,
       
    logging_steps=5,
       
    learning_rate=2e-4,    # Initial learning rate (AdamW optimizer)
       
    weight_decay=0.001,    # Weight decay to apply to all layers except bias/LayerNorm weights
       
    fp16=True, bf16=False, # Enable fp16/bf16 training
       
    max_grad_norm=0.3,     # Maximum gradient normal (gradient clipping)
       
    warmup_ratio=0.03,     # Ratio of steps for a linear warmup (from 0 to learning rate)
       
    group_by_length=True,  # Group sequences into batches with same length # Saves memory and speeds up training considerably
       
    lr_scheduler_type="cosine",
    )

    trainer = SFTTrainer(
       
    model=model,
       
    train_dataset=final_dataset,
       
    peft_config=peft_config,
       
    dataset_text_field="text",
       
    max_seq_length=512,  # Maximum sequence length to use
       
    tokenizer=tokenizer,
       
    args=training_arguments,
       
    packing=False,       # Pack multiple short examples in the same input sequence to increase efficiency
    )

    trainer.train()
    trainer.model.save_pretrained(new_model)

    ################################# Save ########################################

    # Reload model in FP16 and merge it with LoRA weights
    base_model = AutoModelForCausalLM.from_pretrained(
       
    model_name,
       
    low_cpu_mem_usage=True,
       
    return_dict=True,
       
    torch_dtype=torch.float16,
       
    device_map=device_map,
    )
    model = PeftModel.from_pretrained(base_model, new_model)
    model = model.merge_and_unload()

    # push results to Cloud Storage
    file_path_to_save_the_model = "/finetune/new_model"
    model.save_pretrained(file_path_to_save_the_model)
    tokenizer.save_pretrained(file_path_to_save_the_model)

  3. Create a requirements.txt file:
    accelerate==0.34.2
    bitsandbytes
    ==0.45.5
    datasets
    ==2.19.1
    transformers
    ==4.51.3
    peft
    ==0.11.1
    trl
    ==0.8.6
    torch
    ==2.3.0
  4. Create a Dockerfile:
    FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

    RUN apt-get update && \
        apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
        rm -rf /var/lib/apt/lists/*

    COPY requirements.txt /requirements.txt

    RUN pip3 install -r requirements.txt --no-cache-dir

    COPY finetune.py /finetune.py

    ENV PYTHONUNBUFFERED 1

    CMD python3 /finetune.py --device cuda
  5. Build the container in your Artifact Registry repository:
    gcloud builds submit \
      --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
      --region $REGION

5. Deploy and execute the job

In this step, you'll create the job with direct VPC egress for faster uploads to Google Cloud Storage.

  1. Create the Cloud Run Job:
    gcloud beta run jobs create $JOB_NAME \
      --region $REGION \
      --image $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
      --set-env-vars BUCKET_NAME=$BUCKET_NAME \
      --set-secrets HF_TOKEN=$SECRET_ID:latest \
      --cpu 8.0 \
      --memory 32Gi \
      --gpu 1 \
      --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
      --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
      --service-account $SERVICE_ACCOUNT_ADDRESS
  2. Execute the job:
    gcloud beta run jobs execute $JOB_NAME --region $REGION --async

The job will take around 10 minutes to complete. You can check on the status using the link provided in the output of the last command.

6. Use a Cloud Run service to serve your finetuned model with vLLM

In this step, you will deploy a Cloud Run service. This configuration uses direct VPC to access Cloud Storage bucket over private network for faster downloads.

  • Deploy your Cloud Run Service:
    gcloud run deploy serve-gemma-emoji \
      --image us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250601_0916_RC01 \
      --region $REGION \
      --port 8000 \
      --set-env-vars MODEL_ID=new_model,HF_HUB_OFFLINE=1 \
      --cpu 8.0 \
      --memory 32Gi \
      --gpu 1 \
      --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
      --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
      --service-account $SERVICE_ACCOUNT_ADDRESS \
      --max-instances 1 \
      --command python3 \
      --args="-m,vllm.entrypoints.api_server,--model=/finetune/new_model,--tensor-parallel-size=1" \
      --no-gpu-zonal-redundancy \
      --no-allow-unauthenticated

7. Test your fine-tuned model

In this step, you will prompt your model to test the fine tuning using curl.

  1. Get the service URL for your Cloud Run service:
    SERVICE_URL=$(gcloud run services describe serve-gemma-emoji \
        --region $REGION --format 'value(status.url)')
  2. Create your prompt for your model.
    USER_PROMPT="Translate to emoji: I ate a banana for breakfast, later I'm thinking of having soup!"
  3. Call your service using curl to prompt your model, filtering the results with jq:
    curl -s -X POST ${SERVICE_URL}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: bearer $(gcloud auth print-identity-token)" \
    -d @- <<EOF | jq ".choices[0].message.content"
    {   "model": "${NEW_MODEL}",
        "messages": [{
            "role": "user",
            "content": [ { "type": "text", "text": "${USER_PROMPT}"}]
        }]
    }
    EOF

You should see a response similar to the following:

🍌🤔😋🥣

8. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the Cloud Run Jobs GPU documentation.

What we've covered

  • How to conduct fine tuning using Cloud Run Jobs GPU
  • How to serve a model using Cloud Run with vLLM
  • How to use Direct VPC configuration for a GPU Job for faster upload and serving of the model

9. Clean up

To avoid inadvertent charges, for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier, you can delete the Cloud Run service you created in Step 6.

To delete the Cloud Run service, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the serve-gemma-emoji service.

To delete the entire project, go to Manage Resources, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.