Cloud Run 작업을 사용하여 LLM을 미세 조정하는 방법

이 Codelab 정보

최종 업데이트: 6월 3, 2025

작성자: Google 직원

1. 소개

개요

이 Codelab에서는 Cloud Run 작업을 사용하여 Gemma 모델을 미세 조정하고 vLLM을 사용하여 Cloud Run에서 결과를 제공합니다.

이 Codelab에서는 자연어로 질문을 받았을 때 LLM이 SQL 쿼리로 답변하도록 하는 텍스트-SQL 데이터 세트를 사용합니다.

학습할 내용

Cloud Run 작업 GPU를 사용하여 미세 조정하는 방법
vLLM과 함께 Cloud Run을 사용하여 모델을 제공하는 방법
GPU 작업에 직접 VPC 구성을 사용하여 모델을 더 빠르게 업로드하고 제공하는 방법

2. 시작하기 전에

API 사용 설정

이 Codelab을 사용하기 전에 다음을 실행하여 다음 API를 사용 설정합니다.

gcloud services enable run.googleapis.com \
    compute.googleapis.com \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    secretmanager.googleapis.com \
    artifactregistry.googleapis.com

GPU 할당량

지원되는 리전의 할당량 상향 요청 Cloud Run Admin API의 할당량은 nvidia_l4_gpu_allocation_no_zonal_redundancy입니다.

참고: 새 프로젝트를 사용하는 경우 API를 사용 설정하고 이 페이지에 할당량이 표시될 때까지 몇 분 정도 걸릴 수 있습니다.

Hugging Face

이 Codelab에서는 Hugging Face에 호스팅된 모델을 사용합니다. 이 모델을 가져오려면 '읽기' 권한이 있는 Hugging Face 사용자 액세스 토큰을 요청합니다. 이 주소를 나중에 YOUR_HF_TOKEN로 참조합니다.

모델을 사용하려면 사용 약관(https://huggingface.co/google/gemma-2b)에도 동의해야 합니다.

3. 설정 및 요구사항

다음 리소스를 설정합니다.

IAM 서비스 계정 및 연결된 IAM 권한
Hugging Face 토큰을 저장할 Secret Manager 보안 비밀
미세 조정된 모델을 저장할 Cloud Storage 버킷
모델을 미세 조정하기 위해 빌드할 이미지를 저장할 Artifact Registry 저장소

이 Codelab의 환경 변수를 설정합니다. 몇 가지 변수가 자동으로 입력되었습니다. 프로젝트 ID, 리전, Hugging Face 토큰을 지정합니다.

export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=<YOUR_REGION>
export HF_TOKEN=<YOUR_HF_TOKEN>

export AR_REPO=codelab-finetuning-jobs
export IMAGE_NAME=finetune-to-gcs
export JOB_NAME=finetuning-to-gcs-job
export BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
export SECRET_ID=HF_TOKEN
export SERVICE_ACCOUNT="finetune-job-sa"
export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

다음 명령어를 실행하여 서비스 계정을 만듭니다.

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Service account for fine-tuning codelab"

Secret Manager를 사용하여 Hugging Face 액세스 토큰을 저장합니다.

gcloud secrets create $SECRET_ID \
      --replication-policy="automatic"

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

서비스 계정에 Secret Manager 보안 비밀 접근자 역할을 부여합니다.

gcloud secrets add-iam-policy-binding $SECRET_ID \
  --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role='roles/secretmanager.secretAccessor'

미세 조정된 모델을 호스팅할 버킷을 만듭니다.
```
gcloud storage buckets create -l $REGION gs://$BUCKET_NAME
```

서비스 계정에 버킷에 대한 액세스 권한을 부여합니다.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
  --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role=roles/storage.objectAdmin

컨테이너 이미지를 저장할 Artifact Registry 저장소를 만듭니다.

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for finetuning using CR jobs" \
    --project=$PROJECT_ID

4. Cloud Run 작업 이미지 만들기

다음 단계에서는 다음을 실행하는 코드를 만듭니다.

Hugging Face에서 Gemma 모델을 가져옵니다.
Hugging Face의 데이터 세트를 사용하여 모델을 미세 조정합니다. 이 작업은 미세 조정에 단일 L4 GPU를 사용합니다.
new_model라는 미세 조정된 모델을 Cloud Storage 버킷에 업로드합니다.

미세 조정 작업 코드의 디렉터리를 만듭니다.
```
mkdir codelab-finetuning-job
cd codelab-finetuning-job
```

finetune.py라는 파일을 만듭니다.

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,

)
from peft import LoraConfig, PeftModel

from trl import SFTTrainer

# Cloud Storage bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-2b")

# The instruction dataset to use
dataset_name = "b-mc2/sql-create-context"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-2b-sql")

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = int(os.getenv("LORA_R", "4"))

# Alpha parameter for LoRA scaling
lora_alpha = int(os.getenv("LORA_ALPHA", "8"))

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = int(os.getenv("TRAIN_BATCH_SIZE", "1"))

# Batch size per GPU for evaluation
per_device_eval_batch_size = int(os.getenv("EVAL_BATCH_SIZE", "2"))

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "1"))

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = int(os.getenv("LOGGING_STEPS", "50"))

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = int(os.getenv("MAX_SEQ_LENGTH", "512"))

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {'':torch.cuda.current_device()}

# Set limit to a positive number
limit = int(os.getenv("DATASET_LIMIT", "5000"))

dataset = load_dataset(dataset_name, split="train")
if limit != -1:
    dataset = dataset.shuffle(seed=42).select(range(limit))


def transform(data):
    question = data['question']
    context = data['context']
    answer = data['answer']
    template = "Question: {question}\nContext: {context}\nAnswer: {answer}"
    return {'text': template.format(question=question, context=context, answer=answer)}


transformed = dataset.map(transform)

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"]
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=transformed,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

trainer.train()

trainer.model.save_pretrained(new_model)

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# push to Cloud Storage

file_path_to_save_the_model = '/finetune/new_model'
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

requirements.txt 파일을 만듭니다.

accelerate==0.34.2
bitsandbytes==0.45.5
datasets==2.19.1
transformers==4.51.3
peft==0.11.1
trl==0.8.6
torch==2.3.0

Dockerfile 생성:

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt /requirements.txt

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

Artifact Registry 저장소에서 컨테이너를 빌드합니다.

gcloud builds submit \
  --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --region $REGION

5. 작업 배포 및 실행

이 단계에서는 Google Cloud Storage에 더 빠르게 업로드할 수 있도록 직접 VPC 이그레스가 있는 작업의 YAML 구성을 만듭니다.

이 파일에는 다음 단계에서 업데이트할 변수가 포함되어 있습니다.

finetune-job.yaml.tmpl라는 파일을 만듭니다.

apiVersion: run.googleapis.com/v1
kind: Job
metadata:
  name: $JOB_NAME
  labels:
    cloud.googleapis.com/location: $REGION
  annotations:
    run.googleapis.com/launch-stage: ALPHA
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      parallelism: 1
      taskCount: 1
      template:
        spec:
          serviceAccountName: $SERVICE_ACCOUNT_ADDRESS
          containers:
          - name: $IMAGE_NAME
            image: $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME
            env:
            - name: MODEL_NAME
              value: "google/gemma-2b"
            - name: NEW_MODEL
              value: "gemma-2b-sql-finetuned"
            - name: BUCKET_NAME
              value: "$BUCKET_NAME"
            - name: LORA_R
              value: "8"
            - name: LORA_ALPHA
              value: "16"
            - name: GRADIENT_ACCUMULATION_STEPS
              value: "2"
            - name: DATASET_LIMIT
              value: "1000"
            - name: LOGGING_STEPS
              value: "5"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: 'latest'
                  name: HF_TOKEN
            resources:
              limits:
                cpu: 8000m
                nvidia.com/gpu: '1'
                memory: 32Gi
            volumeMounts:
            - mountPath: /finetune/new_model
              name: finetuned_model
          volumes:
          - name: finetuned_model
            csi:
              driver: gcsfuse.run.googleapis.com
              readOnly: false
              volumeAttributes:
                bucketName: $BUCKET_NAME
          maxRetries: 3
          timeoutSeconds: '3600'
          nodeSelector:
            run.googleapis.com/accelerator: nvidia-l4

다음 명령어를 실행하여 YAML의 변수를 환경 변수로 바꿉니다.
```
envsubst < finetune-job.yaml.tmpl > finetune-job.yaml
```

Cloud Run 작업을 만듭니다.

gcloud alpha run jobs replace finetune-job.yaml

작업 실행:

gcloud alpha run jobs execute $JOB_NAME --region $REGION --async

이 작업은 완료하는 데 약 10분 정도 걸립니다. 마지막 명령어의 출력에 제공된 링크를 사용하여 상태를 확인할 수 있습니다.

6. Cloud Run 서비스를 사용하여 vLLM으로 미세 조정된 모델 제공

이 단계에서는 Cloud Run 서비스를 배포합니다. 이 구성은 직접 VPC를 사용하여 비공개 네트워크를 통해 Cloud Storage 버킷에 액세스하여 다운로드 속도를 높입니다.

이 파일에는 다음 단계에서 업데이트할 변수가 포함되어 있습니다.

service.yaml.tmpl 파일을 만듭니다.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: serve-gemma-sql
  labels:
    cloud.googleapis.com/location: $REGION
  annotations:
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
spec:
  template:
    metadata:
      labels:
      annotations:
        autoscaling.knative.dev/maxScale: '1'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/gpu-zonal-redundancy-disabled: 'true'
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      containers:
      - name: serve-finetuned
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250505_0916_RC00
        ports:
        - name: http1
          containerPort: 8000
        resources:
          limits:
            cpu: 8000m
            nvidia.com/gpu: '1'
            memory: 32Gi
        volumeMounts:
        - name: fuse
          mountPath: /finetune/new_model
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=/finetune/new_model
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: 'new_model'
        - name: HF_HUB_OFFLINE
          value: '1'
      volumes:
      - name: fuse
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: $BUCKET_NAME
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

service.yaml 파일을 버킷 이름으로 업데이트합니다.
```
envsubst < service.yaml.tmpl > service.yaml
```

Cloud Run 서비스를 배포합니다.

gcloud alpha run services replace service.yaml

7. 미세 조정된 모델 테스트

이 단계에서는 모델에 미세 조정을 테스트하도록 요청합니다.

Cloud Run 서비스의 서비스 URL을 가져옵니다.

SERVICE_URL=$(gcloud run services describe serve-gemma-sql --platform managed --region $REGION --format 'value(status.url)')

모델의 프롬프트를 만듭니다.

USER_PROMPT="Question: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)"

CURL을 사용하여 서비스를 호출하여 모델에 메시지를 표시합니다.

curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: bearer $(gcloud auth print-identity-token)" \
  -d @- <<EOF
{
    "prompt": "${USER_PROMPT}"
}
EOF

다음과 비슷한 응답이 표시됩니다.

{"predictions":["Prompt:\nQuestion: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)\nOutput:\n CREATE TABLE people_to_candidates (candidate_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people (person_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people_to_candidates (person_id VARCHAR, candidate_id"]}

8. 축하합니다.

Codelab을 완료했습니다. 축하합니다.

Cloud Run 문서를 검토하는 것이 좋습니다.

학습한 내용

Cloud Run 작업 GPU를 사용하여 미세 조정하는 방법
vLLM과 함께 Cloud Run을 사용하여 모델을 제공하는 방법
GPU 작업에 직접 VPC 구성을 사용하여 모델을 더 빠르게 업로드하고 제공하는 방법

9. 삭제

Cloud Run 서비스가 무료 등급의 월별 Cloud Run 호출 할당량보다 더 많은 횟수로 실수로 호출되는 경우와 같이 의도치 않은 청구를 방지하려면 6단계에서 만든 Cloud Run 서비스를 삭제하면 됩니다.

Cloud Run 서비스를 삭제하려면 https://console.cloud.google.com/run의 Cloud Run Cloud 콘솔로 이동하여 serve-gemma-sql 서비스를 삭제합니다.

전체 프로젝트를 삭제하려면 리소스 관리로 이동하여 2단계에서 만든 프로젝트를 선택하고 삭제를 선택합니다. 프로젝트를 삭제하면 Cloud SDK에서 프로젝트를 변경해야 합니다. gcloud projects list를 실행하여 사용 가능한 모든 프로젝트 목록을 볼 수 있습니다.

오류 신고