Cloud Run 작업을 사용하여 GPU에서 일괄 추론을 실행하는 방법

이 Codelab 정보

최종 업데이트: 3월 20, 2025

작성자: Google 직원

이 페이지는 Cloud Translation API를 통해 번역되었습니다.

1. 소개

개요

이 예에서는 자연어로 질문을 받았을 때 LLM이 SQL 쿼리로 답변하도록 텍스트-SQL 데이터 세트를 사용하여 gemma-2b 모델을 미세 조정합니다. 그런 다음 미세 조정된 모델을 가져와 vLLM을 사용하여 Cloud Run에 제공합니다.

학습할 내용

Cloud Run Jobs GPU를 사용하여 미세 조정하는 방법
GPU 작업에 직접 VPC 구성을 사용하여 모델을 더 빠르게 업로드하고 제공하는 방법

2. 시작하기 전에

GPU 기능을 사용하려면 지원되는 리전의 할당량 상향을 요청해야 합니다. 필요한 할당량은 Cloud Run Admin API에 있는 nvidia_l4_gpu_allocation_no_zonal_redundancy입니다. 할당량 요청 바로가기 링크입니다.

3. 설정 및 요구사항

이 Codelab 전체에서 사용할 환경 변수를 설정합니다.

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>
HF_TOKEN=<YOUR_HF_TOKEN>

AR_REPO=codelab-finetuning-jobs
IMAGE_NAME=finetune-to-gcs
JOB_NAME=finetuning-to-gcs-job
BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
SECRET_ID=HF_TOKEN
SERVICE_ACCOUNT="finetune-job-sa"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

다음 명령어를 실행하여 서비스 계정을 만듭니다.

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Cloud Run job to access HF_TOKEN Secret ID"

Secret Manager를 사용하여 HuggingFace 액세스 토큰을 저장합니다.

Secret Manager 문서에서 보안 비밀 만들기 및 사용에 대해 자세히 알아보세요.

gcloud secrets create $SECRET_ID \
    --replication-policy="automatic"

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

다음과 비슷한 출력이 표시됩니다.

you'll see output similar to

Created secret [HF_TOKEN].
Created version [1] of the secret [HF_TOKEN].

기본 컴퓨팅 서비스 계정에 Secret Manager 보안 비밀 접근자 역할을 부여합니다.

gcloud secrets add-iam-policy-binding $SECRET_ID \
   --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
   --role='roles/secretmanager.secretAccessor'

미세 조정된 모델을 호스팅할 버킷 만들기

gsutil mb -l $REGION gs://$BUCKET_NAME

그런 다음 SA에 버킷에 대한 액세스 권한을 부여합니다.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

작업의 아티팩트 저장소 만들기

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for finetuning using CR jobs" \
    --project=$PROJECT_ID

미세 조정된 모델을 위한 Cloud Storage 버킷 만들기

gsutil mb -l $REGION gs://$BUCKET_NAME

마지막으로 Cloud Run 작업의 Artifact Registry 저장소를 만듭니다.

gcloud artifacts repositories create $AR_REPO \
 --repository-format=docker \
 --location=$REGION \
 --description="codelab for finetuning using cloud run jobs"

4. Cloud Run 작업 이미지 만들기

다음 단계에서는 다음을 실행하는 코드를 만듭니다.

huggingface에서 gemma-2b를 가져옵니다.
huggingface의 데이터 세트를 사용하여 텍스트-SQL 데이터 세트로 gemma-2b를 미세 조정합니다. 작업은 미세 조정에 단일 L4 GPU를 사용합니다.
new_model이라는 미세 조정된 모델을 사용자의 GCS 버킷에 업로드합니다.

미세 조정 작업 코드의 디렉터리를 만듭니다.

mkdir codelab-finetuning-job
cd codelab-finetuning-job

finetune.py라는 파일을 만듭니다.

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,

)
from peft import LoraConfig, PeftModel

from trl import SFTTrainer
from pathlib import Path

# GCS bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-2b")

# The instruction dataset to use
dataset_name = "b-mc2/sql-create-context"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-2b-sql")

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = int(os.getenv("LORA_R", "4"))

# Alpha parameter for LoRA scaling
lora_alpha = int(os.getenv("LORA_ALPHA", "8"))

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = int(os.getenv("TRAIN_BATCH_SIZE", "1"))

# Batch size per GPU for evaluation
per_device_eval_batch_size = int(os.getenv("EVAL_BATCH_SIZE", "2"))

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "1"))

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = int(os.getenv("LOGGING_STEPS", "50"))

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = int(os.getenv("MAX_SEQ_LENGTH", "512"))

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {'':torch.cuda.current_device()}

# Set limit to a positive number
limit = int(os.getenv("DATASET_LIMIT", "5000"))

dataset = load_dataset(dataset_name, split="train")
if limit != -1:
    dataset = dataset.shuffle(seed=42).select(range(limit))


def transform(data):
    question = data['question']
    context = data['context']
    answer = data['answer']
    template = "Question: {question}\nContext: {context}\nAnswer: {answer}"
    return {'text': template.format(question=question, context=context, answer=answer)}


transformed = dataset.map(transform)

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16")
        print("=" * 80)

# Load base model
# model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"]
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=transformed,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

trainer.train()

trainer.model.save_pretrained(new_model)

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Push to HF
# model.push_to_hub(new_model, check_pr=True)
# tokenizer.push_to_hub(new_model, check_pr=True)

# push to GCS

file_path_to_save_the_model = '/finetune/new_model'
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

requirements.txt 파일을 만듭니다.

accelerate==0.30.1
bitsandbytes==0.43.1
datasets==2.19.1
transformers==4.41.0
peft==0.11.1
trl==0.8.6
torch==2.3.0

Dockerfile를 만드는 방법

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

Artifact Registry 저장소에서 컨테이너 빌드

gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME

5. 작업 배포 및 실행

이 단계에서는 Google Cloud Storage에 더 빠르게 업로드할 수 있도록 직접 VPC 이그레스가 포함된 작업 YAML 구성을 만듭니다.

이 파일에는 다음 단계에서 업데이트할 변수가 포함되어 있습니다.

먼저 finetune-job.yaml라는 파일을 만듭니다.

apiVersion: run.googleapis.com/v1
kind: Job
metadata:
  name: finetuning-to-gcs-job
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    run.googleapis.com/launch-stage: ALPHA
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      parallelism: 1
      taskCount: 1
      template:
        spec:
          serviceAccountName: YOUR_SERVICE_ACCOUNT_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com
          containers:
          - name: finetune-to-gcs
            image: YOUR_REGION-docker.pkg.dev/YOUR_PROJECT_ID/YOUR_AR_REPO/YOUR_IMAGE_NAME
            env:
            - name: MODEL_NAME
              value: "google/gemma-2b"
            - name: NEW_MODEL
              value: "gemma-2b-sql-finetuned"
            - name: LORA_R
              value: "8"
            - name: LORA_ALPHA
              value: "16"
            - name: TRAIN_BATCH_SIZE
              value: "1"
            - name: EVAL_BATCH_SIZE
              value: "2"
            - name: GRADIENT_ACCUMULATION_STEPS
              value: "2"
            - name: DATASET_LIMIT
              value: "1000"
            - name: MAX_SEQ_LENGTH
              value: "512"
            - name: LOGGING_STEPS
              value: "5"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: 'latest'
                  name: HF_TOKEN
            resources:
              limits:
                cpu: 8000m
                nvidia.com/gpu: '1'
                memory: 32Gi
            volumeMounts:
            - mountPath: /finetune/new_model
              name: finetuned_model
          volumes:
          - name: finetuned_model
            csi:
              driver: gcsfuse.run.googleapis.com
              readOnly: false
              volumeAttributes:
                bucketName: YOUR_RPOJECT_ID-codelab-finetuning-jobs
          maxRetries: 3
          timeoutSeconds: '3600'
          nodeSelector:
            run.googleapis.com/accelerator: nvidia-l4

이제 다음 명령어를 실행하여 자리표시자를 이미지의 환경 변수로 바꿉니다.

sed -i "s/YOUR_SERVICE_ACCOUNT_NAME/$SERVICE_ACCOUNT/; s/YOUR_PROJECT_ID/$PROJECT_ID/;  s/YOUR_PROJECT_ID/$PROJECT_ID/; s/YOUR_REGION/$REGION/; s/YOUR_AR_REPO/$AR_REPO/; s/YOUR_IMAGE_NAME/$IMAGE_NAME/; s/YOUR_PROJECT_ID/$PROJECT_ID/" finetune-job.yaml

다음으로 Cloud Run 작업을 만듭니다.

gcloud alpha run jobs replace finetune-job.yaml

작업을 실행합니다. 이 과정은 약 10분 정도 걸립니다.

gcloud alpha run jobs execute $JOB_NAME --region $REGION

6. Cloud Run 서비스를 사용하여 vLLM으로 미세 조정된 모델 제공

미세 조정된 모델을 제공할 Cloud Run 서비스 코드의 폴더 만들기

cd ..
mkdir codelab-finetuning-service
cd codelab-finetuning-service

service.yaml 파일 만들기

이 구성은 더 빠른 다운로드를 위해 직접 VPC를 사용하여 비공개 네트워크를 통해 GCS 버킷에 액세스합니다.

이 파일에는 다음 단계에서 업데이트할 변수가 포함되어 있습니다.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: serve-gemma2b-sql
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
spec:
  template:
    metadata:
      labels:
      annotations:
        autoscaling.knative.dev/maxScale: '5'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      containers:
      - name: serve-finetuned
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        ports:
        - name: http1
          containerPort: 8000
        resources:
          limits:
            cpu: 8000m
            nvidia.com/gpu: '1'
            memory: 32Gi
        volumeMounts:
        - name: fuse
          mountPath: /finetune/new_model
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=/finetune/new_model
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: 'new_model'
        - name: HF_HUB_OFFLINE
          value: '1'
      volumes:
      - name: fuse
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: YOUR_BUCKET_NAME
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

service.yaml 파일을 버킷 이름으로 업데이트합니다.

sed -i "s/YOUR_BUCKET_NAME/$BUCKET_NAME/" finetune-job.yaml

이제 Cloud Run 서비스를 배포합니다.

gcloud alpha run services replace service.yaml

7. 미세 조정된 모델 테스트

먼저 Cloud Run 서비스의 서비스 URL을 가져옵니다.

SERVICE_URL=$(gcloud run services describe serve-gemma2b-sql --platform managed --region $REGION --format 'value(status.url)')

모델의 프롬프트를 만듭니다.

USER_PROMPT="Question: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)"

이제 서비스를 curl합니다.

curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: bearer $(gcloud auth print-identity-token)" \
  -d @- <<EOF
{
    "prompt": "${USER_PROMPT}",
    "temperature": 0.1,
    "top_p": 1.0,
    "max_tokens": 56
}
EOF

다음과 비슷한 응답이 표시됩니다.

{"predictions":["Prompt:\nQuestion: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)\nOutput:\n CREATE TABLE people_to_candidates (candidate_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people (person_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people_to_candidates (person_id VARCHAR, candidate_id"]}

오류 신고