如何使用 Cloud Run 作业在 GPU 上运行批量推理

关于此 Codelab

上次更新时间：3月 20, 2025

Google 员工编写

1. 简介

概览

在本例中，您将使用文本转换为 SQL 数据集对 gemma-2b 模型进行微调，以便 LLM 在收到自然语言问题时以 SQL 查询的形式进行回答。然后，您将使用 vLLM 将经过微调的模型部署到 Cloud Run 上。

学习内容

如何使用 Cloud Run 作业 GPU 进行微调
如何为 GPU 作业使用直接 VPC 配置，以更快地上传和分发模型

2. 准备工作

如需使用 GPU 功能，您必须为受支持的区域申请增加配额。所需配额为 nvidia_l4_gpu_allocation_no_zonal_redundancy，该配额位于 Cloud Run Admin API 下。以下是用于申请配额的直接链接。

3. 设置和要求

设置将在此 Codelab 中全程使用的环境变量。

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>
HF_TOKEN=<YOUR_HF_TOKEN>

AR_REPO=codelab-finetuning-jobs
IMAGE_NAME=finetune-to-gcs
JOB_NAME=finetuning-to-gcs-job
BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
SECRET_ID=HF_TOKEN
SERVICE_ACCOUNT="finetune-job-sa"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

运行以下命令来创建服务账号：

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Cloud Run job to access HF_TOKEN Secret ID"

使用 Secret Manager 存储 HuggingFace 访问令牌。

如需详细了解如何创建和使用 Secret，请参阅 Secret Manager 文档。

gcloud secrets create $SECRET_ID \
    --replication-policy="automatic"

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

您将看到如下所示的输出

you'll see output similar to

Created secret [HF_TOKEN].
Created version [1] of the secret [HF_TOKEN].

向您的默认计算服务账号授予 Secret Manager Secret Accessor 角色

gcloud secrets add-iam-policy-binding $SECRET_ID \
   --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
   --role='roles/secretmanager.secretAccessor'

创建一个存储经过微调的模型的存储分区

gsutil mb -l $REGION gs://$BUCKET_NAME

然后，向 SA 授予对存储分区的访问权限。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

为作业创建 Artifact Registry 代码库

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for finetuning using CR jobs" \
    --project=$PROJECT_ID

为经过微调的模型创建 Cloud Storage 存储分区

gsutil mb -l $REGION gs://$BUCKET_NAME

最后，为您的 Cloud Run 作业创建一个 Artifact Registry 仓库。

gcloud artifacts repositories create $AR_REPO \
 --repository-format=docker \
 --location=$REGION \
 --description="codelab for finetuning using cloud run jobs"

4. 创建 Cloud Run 作业映像

在下一步中，您将创建以下代码：

从 huggingface 导入 gemma-2b
使用来自 huggingface 的数据集对 gemma-2b 进行微调，并使用文本转换为 SQL 数据集。该作业使用单个 L4 GPU 进行微调。
将名为 new_model 的精调模型上传到用户的 GCS 存储分区

为微调作业代码创建一个目录。

mkdir codelab-finetuning-job
cd codelab-finetuning-job

创建一个名为 finetune.py 的文件

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,

)
from peft import LoraConfig, PeftModel

from trl import SFTTrainer
from pathlib import Path

# GCS bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-2b")

# The instruction dataset to use
dataset_name = "b-mc2/sql-create-context"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-2b-sql")

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = int(os.getenv("LORA_R", "4"))

# Alpha parameter for LoRA scaling
lora_alpha = int(os.getenv("LORA_ALPHA", "8"))

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = int(os.getenv("TRAIN_BATCH_SIZE", "1"))

# Batch size per GPU for evaluation
per_device_eval_batch_size = int(os.getenv("EVAL_BATCH_SIZE", "2"))

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "1"))

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = int(os.getenv("LOGGING_STEPS", "50"))

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = int(os.getenv("MAX_SEQ_LENGTH", "512"))

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {'':torch.cuda.current_device()}

# Set limit to a positive number
limit = int(os.getenv("DATASET_LIMIT", "5000"))

dataset = load_dataset(dataset_name, split="train")
if limit != -1:
    dataset = dataset.shuffle(seed=42).select(range(limit))


def transform(data):
    question = data['question']
    context = data['context']
    answer = data['answer']
    template = "Question: {question}\nContext: {context}\nAnswer: {answer}"
    return {'text': template.format(question=question, context=context, answer=answer)}


transformed = dataset.map(transform)

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16")
        print("=" * 80)

# Load base model
# model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"]
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=transformed,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

trainer.train()

trainer.model.save_pretrained(new_model)

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Push to HF
# model.push_to_hub(new_model, check_pr=True)
# tokenizer.push_to_hub(new_model, check_pr=True)

# push to GCS

file_path_to_save_the_model = '/finetune/new_model'
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

创建 requirements.txt 文件。

accelerate==0.30.1
bitsandbytes==0.43.1
datasets==2.19.1
transformers==4.41.0
peft==0.11.1
trl==0.8.6
torch==2.3.0

创建 Dockerfile

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

在 Artifact Registry 代码库中构建容器

gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME

5. 部署和执行作业

在此步骤中，您将创建使用直接 VPC 出站流量的作业 YAML 配置，以便更快地将数据上传到 Google Cloud Storage。

请注意，此文件包含您将在后续步骤中更新的变量。

首先，创建一个名为 finetune-job.yaml 的文件

apiVersion: run.googleapis.com/v1
kind: Job
metadata:
  name: finetuning-to-gcs-job
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    run.googleapis.com/launch-stage: ALPHA
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      parallelism: 1
      taskCount: 1
      template:
        spec:
          serviceAccountName: YOUR_SERVICE_ACCOUNT_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com
          containers:
          - name: finetune-to-gcs
            image: YOUR_REGION-docker.pkg.dev/YOUR_PROJECT_ID/YOUR_AR_REPO/YOUR_IMAGE_NAME
            env:
            - name: MODEL_NAME
              value: "google/gemma-2b"
            - name: NEW_MODEL
              value: "gemma-2b-sql-finetuned"
            - name: LORA_R
              value: "8"
            - name: LORA_ALPHA
              value: "16"
            - name: TRAIN_BATCH_SIZE
              value: "1"
            - name: EVAL_BATCH_SIZE
              value: "2"
            - name: GRADIENT_ACCUMULATION_STEPS
              value: "2"
            - name: DATASET_LIMIT
              value: "1000"
            - name: MAX_SEQ_LENGTH
              value: "512"
            - name: LOGGING_STEPS
              value: "5"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: 'latest'
                  name: HF_TOKEN
            resources:
              limits:
                cpu: 8000m
                nvidia.com/gpu: '1'
                memory: 32Gi
            volumeMounts:
            - mountPath: /finetune/new_model
              name: finetuned_model
          volumes:
          - name: finetuned_model
            csi:
              driver: gcsfuse.run.googleapis.com
              readOnly: false
              volumeAttributes:
                bucketName: YOUR_RPOJECT_ID-codelab-finetuning-jobs
          maxRetries: 3
          timeoutSeconds: '3600'
          nodeSelector:
            run.googleapis.com/accelerator: nvidia-l4

现在，运行以下命令，将占位符替换为映像的环境变量：

sed -i "s/YOUR_SERVICE_ACCOUNT_NAME/$SERVICE_ACCOUNT/; s/YOUR_PROJECT_ID/$PROJECT_ID/;  s/YOUR_PROJECT_ID/$PROJECT_ID/; s/YOUR_REGION/$REGION/; s/YOUR_AR_REPO/$AR_REPO/; s/YOUR_IMAGE_NAME/$IMAGE_NAME/; s/YOUR_PROJECT_ID/$PROJECT_ID/" finetune-job.yaml

接下来，创建 Cloud Run 作业

gcloud alpha run jobs replace finetune-job.yaml

并执行作业。此过程大约需要 10 分钟。

gcloud alpha run jobs execute $JOB_NAME --region $REGION

6. 使用 Cloud Run 服务通过 vLLM 提供经过微调的模型

为将提供经过微调的模型的 Cloud Run 服务代码创建一个文件夹

cd ..
mkdir codelab-finetuning-service
cd codelab-finetuning-service

创建 service.yaml 文件

此配置使用直接 VPC 通过专用网络访问 GCS 存储分区，以加快下载速度。

请注意，此文件包含您将在后续步骤中更新的变量。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: serve-gemma2b-sql
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
spec:
  template:
    metadata:
      labels:
      annotations:
        autoscaling.knative.dev/maxScale: '5'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      containers:
      - name: serve-finetuned
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        ports:
        - name: http1
          containerPort: 8000
        resources:
          limits:
            cpu: 8000m
            nvidia.com/gpu: '1'
            memory: 32Gi
        volumeMounts:
        - name: fuse
          mountPath: /finetune/new_model
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=/finetune/new_model
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: 'new_model'
        - name: HF_HUB_OFFLINE
          value: '1'
      volumes:
      - name: fuse
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: YOUR_BUCKET_NAME
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

使用您的存储分区名称更新 service.yaml 文件。

sed -i "s/YOUR_BUCKET_NAME/$BUCKET_NAME/" finetune-job.yaml

现在，部署您的 Cloud Run 服务

gcloud alpha run services replace service.yaml

7. 测试经过微调的模型

首先，获取 Cloud Run 服务的服务网址。

SERVICE_URL=$(gcloud run services describe serve-gemma2b-sql --platform managed --region $REGION --format 'value(status.url)')

为模型创建提示。

USER_PROMPT="Question: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)"

现在，使用 curl 命令调用您的服务

curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: bearer $(gcloud auth print-identity-token)" \
  -d @- <<EOF
{
    "prompt": "${USER_PROMPT}",
    "temperature": 0.1,
    "top_p": 1.0,
    "max_tokens": 56
}
EOF

您应该会看到如下所示的响应：

{"predictions":["Prompt:\nQuestion: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)\nOutput:\n CREATE TABLE people_to_candidates (candidate_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people (person_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people_to_candidates (person_id VARCHAR, candidate_id"]}

报告错误