如何在 Cloud Run 函数的边车中托管 LLM

1. 简介

概览

在此 Codelab 中，您将学习如何在 Cloud Run 函数的边车中托管 gemma3:4b 模型。当文件上传到 Cloud Storage 存储分区时，会触发 Cloud Run 函数。该函数会将文件的内容发送到边车中的 Gemma 3 以进行总结。

学习内容

如何使用 Cloud Run 函数和在边车中托管的 LLM（使用 GPU）进行推理
如何为 Cloud Run GPU 使用直接 VPC 出站流量配置，以更快地上传和提供模型服务
如何使用 Genkit 与托管的 Ollama 模型进行交互

2. 准备工作

如需使用 GPU 功能，您必须为受支持的区域申请增加配额。所需配额为 nvidia_l4_gpu_allocation_no_zonal_redundancy，该配额位于 Cloud Run Admin API 下。点击此直接链接即可申请配额。

3. 设置和要求

设置将在本 Codelab 中全程使用的环境变量。

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

运行以下命令以创建服务账号：

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

我们将使用用作 Cloud Run 函数身份的同一服务账号作为 Eventarc 触发器的服务账号来调用 Cloud Run 函数。如果您愿意，可以为 Eventarc 创建其他服务账号。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

另请授予该服务账号接收 Eventarc 事件的权限。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

创建一个用于托管微调模型的存储分区。此 Codelab 使用区域级存储分区。您也可以使用多区域存储分区。

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

然后授予该服务账号对相应存储分区的访问权限。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

现在，创建一个区域存储分区，用于存储要总结的文档。您也可以使用多区域存储分区，前提是您要相应地更新 Eventarc 触发器（如本 Codelab 结尾处所示）。

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

然后，向该服务账号授予对 Gemma 3 存储分区的访问权限。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

以及 Google 文档存储分区。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

为将在边车中使用的 Ollama 映像创建 Artifact Registry 制品库

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. 下载 Gemma 3 模型

首先，您需要从 ollama 下载 Gemma 3 4b 模型。为此，您可以安装 ollama，然后在本地运行 gemma3:4b 模型。

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

现在，在另一个终端窗口中，运行以下命令来拉取模型。如果您使用的是 Cloud Shell，可以点击右上角菜单栏中的加号图标，打开一个额外的终端窗口。

ollama run gemma3:4b

Ollama 运行后，您可以随意向模型提出一些问题，例如：

"why is the sky blue?"

与 Ollama 聊天结束后，您可以运行以下命令来退出聊天

/bye

然后，在第一个终端窗口中，运行以下命令以停止在本地提供 ollama 服务

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

您可以点击此处，了解 Ollama 在不同操作系统中下载模型的位置。

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

如果您使用的是 Cloud Workstations，则可以在此处找到下载的 ollama 模型 /home/$USER/.ollama/models

确认您的模型托管在此处：

ls /home/$USER/.ollama/models

现在，将 gemma3:4b 模型移至您的 GCS 存储分区

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. 创建 Cloud Run 函数

为源代码创建根文件夹。

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

创建一个名为 src 的子文件夹。在该文件夹中，创建一个名为 index.ts 的文件

mkdir src &&
touch src/index.ts

使用以下代码更新 index.ts：

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

现在，在根目录 crf-sidecar-gpu 中，创建一个名为 package.json 的文件，其中包含以下内容：

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

在根目录级别创建一个 tsconfig.json，其中包含以下内容：

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. 部署函数

在此步骤中，您将通过运行以下命令来部署 Cloud Run 函数。

注意：最大实例数应设置为小于或等于 GPU 配额的数字。

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. 创建边车

如需详细了解如何在 Cloud Run 服务中托管 Ollama，请访问 https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama

进入边车目录：

cd ../ollama-gemma3

创建一个包含以下内容的 Dockerfile 文件：

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

构建映像

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. 使用 Sidecar 更新函数

如需向现有服务、作业或函数添加边车，您可以更新 YAML 文件以包含边车。

运行以下命令，检索您刚刚部署的 Cloud Run 函数的 YAML：

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

现在，通过更新 YAML 将边车添加到 CRf，如下所示：

将以下 YAML 代码段直接插入到 runtimeClassName: run.googleapis.com/linux-base-image-update 行上方。-image 应与 Ingress 容器项 -image 对齐

    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1

运行以下命令，使用您的环境变量更新 YAML 片段：

sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

完成后的 YAML 文件应如下所示：

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

现在，运行以下命令，使用边车更新函数。

gcloud run services replace add-sidecar-service.yaml

最后，为函数创建 Eventarc 触发器。此命令还会将其添加到函数中。

注意：如果您创建了多区域存储分区，则需要更改 --location 参数

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. 测试函数

上传纯文本文件以进行总结。不知道要总结什么内容？问问 Gemini，快速生成一份 1-2 页的狗狗历史简介！然后，将该纯文本文件上传到您的 $BUCKET_DOCS_NAME 存储分区，以便 Gemma3:4b 模型将摘要写入函数日志。

在日志中，您会看到与以下内容类似的内容：

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. 问题排查

以下是您可能会遇到的一些拼写错误：

如果您收到 PORT 8080 is in use 错误，请确保 Ollama 边车的 Dockerfile 使用的是端口 11434。此外，请确保您使用的是正确的边车映像（如果您在 AR 代码库中有多个 Ollama 映像）。Cloud Run 函数在端口 8080 上运行，如果您使用的 Ollama 映像作为边车也在端口 8080 上运行，则会遇到此错误。
如果您收到 failed to build: (error ID: 7485c5b6): function.js does not exist 错误，请确保 package.json 和 tsconfig.json 文件与 src 目录位于同一层级。
如果您收到错误 ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements.，请在 YAML 文件中将 autoscaling.knative.dev/maxScale: '100' 更改为 1 或小于或等于 GPU 配额的值。

如何在 Cloud Run 函数的边车中托管 LLM 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。