Bigtable 和 Dataflow：資料庫監控藝術 (HBase Java 用戶端)

程式碼研究室簡介

上次更新時間：10月 8, 2020

作者：thebilly

本頁面由 Cloud Translation API 翻譯而成。

1. 簡介

在本程式碼研究室中，您將使用 Cloud Bigtable 的監控工具，透過 Cloud Dataflow 與 Java HBase 用戶端寫入及讀取資料，藉此建立各種藝術作品。

你將瞭解如何

使用 Cloud Dataflow 將大量資料載入 Bigtable
在擷取資料時監控 Bigtable 執行個體和資料表
使用 Dataflow 工作查詢 Bigtable
探索 Key Visualizer 工具，該工具可用來找出符合結構定義設計的流量資源使用率不均的問題
利用 Key Visualizer 創作藝術

您對 Cloud Bigtable 的使用體驗如何？

新手中級還算容易

您會如何使用這個教學課程？

僅供閱讀閱讀並完成練習

2. 建立 Bigtable 資料庫

Google BigTable 是 Google 的 NoSQL 大數據資料庫服務，也是許多核心 Google 服務採用的資料庫，包括 Google 搜尋、Analytics、Google 地圖和 Gmail。非常適合執行大型分析工作負載及建構低延遲應用程式。請參閱「Cloud Bigtable 程式碼研究室簡介」

建立專案

首先，請建立新專案。使用內建的 Cloud Shell，只要按一下「啟用 Cloud Shell」即可開啟。

請設定下列環境變數，方便複製及貼上程式碼研究室指令：

BIGTABLE_PROJECT=$GOOGLE_CLOUD_PROJECT
INSTANCE_ID="keyviz-art-instance"
CLUSTER_ID="keyviz-art-cluster"
TABLE_ID="art"
CLUSTER_NUM_NODES=1
CLUSTER_ZONE="us-central1-c" # You can choose a zone closer to you

Cloud Shell 隨附您將在本程式碼研究室中使用的工具，包括 gcloud 指令列工具、cbt 指令列介面和 Maven。

執行下列指令，啟用 Cloud Bigtable API。

gcloud services enable bigtable.googleapis.com bigtableadmin.googleapis.com

執行下列指令以建立執行個體：

gcloud bigtable instances create $INSTANCE_ID \
    --cluster=$CLUSTER_ID \
    --cluster-zone=$CLUSTER_ZONE \
    --cluster-num-nodes=$CLUSTER_NUM_NODES \
    --display-name=$INSTANCE_ID

建立執行個體後，請填入 cbt 設定檔，然後執行下列指令，建立資料表和資料欄系列：

echo project = $GOOGLE_CLOUD_PROJECT > ~/.cbtrc
echo instance = $INSTANCE_ID >> ~/.cbtrc

cbt createtable $TABLE_ID
cbt createfamily $TABLE_ID cf

3. 學習：使用 Dataflow 寫入 Bigtable

寫作基本概念

寫入 Cloud Bigtable 時，您必須提供 CloudBigtableTableConfiguration 設定物件。這個物件會指定資料表的專案 ID 與執行個體 ID，以及資料表本身名稱：

CloudBigtableTableConfiguration bigtableTableConfig =
    new CloudBigtableTableConfiguration.Builder()
        .withProjectId(PROJECT_ID)
        .withInstanceId(INSTANCE_ID)
        .withTableId(TABLE_ID)
        .build();

接著，管道會傳遞 HBase Mutation 物件，可包含 Put 和 Delete。

p.apply(Create.of("hello", "world"))
    .apply(
        ParDo.of(
            new DoFn<String, Mutation>() {
              @ProcessElement
              public void processElement(@Element String rowkey, OutputReceiver<Mutation> out) {
                long timestamp = System.currentTimeMillis();
                Put row = new Put(Bytes.toBytes(rowkey));

                row.addColumn(...);
                out.output(row);
              }
            }))
    .apply(CloudBigtableIO.writeToTable(bigtableTableConfig));

LoadData Dataflow 工作

下一頁會說明如何執行 LoadData 工作，但以下將說明管道的重要部分。

如要產生資料，您將建立使用 GenerateSequence 類別 (類似 for 迴圈) 來寫入含有數 MB 隨機資料的資料列。資料列索引鍵將是填充及反轉的序號，因此 250 會變為 0000000052。

LoadData.java

String numberFormat = "%0" + maxLength + "d";

p.apply(GenerateSequence.from(0).to(max))
    .apply(
        ParDo.of(
            new DoFn<Long, Mutation>() {
              @ProcessElement
              public void processElement(@Element Long rowkey, OutputReceiver<Mutation> out) {
                String paddedRowkey = String.format(numberFormat, rowkey);

                // Reverse the rowkey for more efficient writing
                String reversedRowkey = new StringBuilder(paddedRowkey).reverse().toString();
                Put row = new Put(Bytes.toBytes(reversedRowkey));

                // Generate random bytes
                byte[] b = new byte[(int) rowSize];
                new Random().nextBytes(b);

                long timestamp = System.currentTimeMillis();
                row.addColumn(Bytes.toBytes(COLUMN_FAMILY), Bytes.toBytes("C"), timestamp, b);
                out.output(row);
              }
            }))
    .apply(CloudBigtableIO.writeToTable(bigtableTableConfig));

4. 在 Bigtable 中產生資料並監控流入

下列指令會執行 Dataflow 工作，在資料表產生 40 GB 的資料，比 Key Visualizer 足以啟動資料：

啟用 Cloud Dataflow API

gcloud services enable dataflow.googleapis.com

從 GitHub 取得程式碼並變更為目錄

git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
cd java-docs-samples/bigtable/beam/keyviz-art

產生資料 (指令碼大約需要 15 分鐘)

mvn compile exec:java -Dexec.mainClass=keyviz.LoadData \
"-Dexec.args=--bigtableProjectId=$BIGTABLE_PROJECT \
--bigtableInstanceId=$INSTANCE_ID --runner=dataflow \
--bigtableTableId=$TABLE_ID --project=$GOOGLE_CLOUD_PROJECT"

監控匯入作業

您可以在 Cloud Dataflow UI 中監控工作。另外，您也可以透過監控 UI 查看 Cloud Bigtable 執行個體的負載。

在 Dataflow UI 中，您可以查看工作圖表和各種工作指標，包括已處理的元素、目前的 vCPU 和處理量。

Bigtable 具備標準監控工具，可在執行個體、叢集和資料表層級執行讀取/寫入作業、使用的儲存空間、錯誤率等。此外，Bigtable 也具備 Key Visualizer，能根據資料列索引鍵細分用量，並在至少 30 GB 的資料產生後使用。

5. 學習：使用 Dataflow 讀取 Bigtable 資料

閱讀基本概念

讀取 Cloud Bigtable 時，您必須提供 CloudBigtableTableScanConfiguration 設定物件。這與 CloudBigtableTableConfiguration 類似，但您可以指定要掃描及讀取的資料列。

Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setFilter(new FirstKeyOnlyFilter());

CloudBigtableScanConfiguration config =
    new CloudBigtableScanConfiguration.Builder()
        .withProjectId(options.getBigtableProjectId())
        .withInstanceId(options.getBigtableInstanceId())
        .withTableId(options.getBigtableTableId())
        .withScan(scan)
        .build();

然後使用該陳述式啟動管道：

p.apply(Read.from(CloudBigtableIO.read(config)))
    .apply(...

不過，如果您想在管道中執行讀取作業，可以將 CloudBigtableTableConfiguration 傳遞至擴充 AbstractCloudBigtableTableDoFn 的 doFn。

p.apply(GenerateSequence.from(0).to(10))
    .apply(ParDo.of(new ReadFromTableFn(bigtableTableConfig, options)));

接著使用設定和 getConnection() 呼叫 super()，取得分散式連線。

public static class ReadFromTableFn extends AbstractCloudBigtableTableDoFn<Long, Void> {
    public ReadFromTableFn(CloudBigtableConfiguration config, ReadDataOptions readDataOptions) {
      super(config);
    }

    @ProcessElement
    public void processElement(PipelineOptions po) {
        Table table = getConnection().getTable(TableName.valueOf(options.getBigtableTableId()));
        ResultScanner imageData = table.getScanner(scan);
    }   
}

ReadData Dataflow 工作

在本程式碼研究室中，您每秒必須讀取資料表，以便利用產生的序列來啟動管道。這個序列會根據輸入的 CSV 檔案時間觸發多個讀取範圍。

您可利用一些數學方式來判定特定時間要掃描哪個資料列範圍，如要瞭解詳情，請點選檔案名稱查看原始碼。

ReadData.java

p.apply(GenerateSequence.from(0).withRate(1, new Duration(1000)))
    .apply(ParDo.of(new ReadFromTableFn(bigtableTableConfig, options)));

ReadData.java

  public static class ReadFromTableFn extends AbstractCloudBigtableTableDoFn<Long, Void> {

    List<List<Float>> imageData = new ArrayList<>();
    String[] keys;

    public ReadFromTableFn(CloudBigtableConfiguration config, ReadDataOptions readDataOptions) {
      super(config);
      keys = new String[Math.toIntExact(getNumRows(readDataOptions))];
      downloadImageData(readDataOptions.getFilePath());
      generateRowkeys(getNumRows(readDataOptions));
    }

    @ProcessElement
    public void processElement(PipelineOptions po) {
      // Determine which column will be drawn based on runtime of job.
      long timestampDiff = System.currentTimeMillis() - START_TIME;
      long minutes = (timestampDiff / 1000) / 60;
      int timeOffsetIndex = Math.toIntExact(minutes / KEY_VIZ_WINDOW_MINUTES);

      ReadDataOptions options = po.as(ReadDataOptions.class);
      long count = 0;

      List<RowRange> ranges = getRangesForTimeIndex(timeOffsetIndex, getNumRows(options));
      if (ranges.size() == 0) {
        return;
      }

      try {
        // Scan with a filter that will only return the first key from each row. This filter is used
        // to more efficiently perform row count operations.
        Filter rangeFilters = new MultiRowRangeFilter(ranges);
        FilterList firstKeyFilterWithRanges = new FilterList(
            rangeFilters,
            new FirstKeyOnlyFilter(),
            new KeyOnlyFilter());
        Scan scan =
            new Scan()
                .addFamily(Bytes.toBytes(COLUMN_FAMILY))
                .setFilter(firstKeyFilterWithRanges);

        Table table = getConnection().getTable(TableName.valueOf(options.getBigtableTableId()));
        ResultScanner imageData = table.getScanner(scan);
      } catch (Exception e) {
        System.out.println("Error reading.");
        e.printStackTrace();
      }
    }

    /**
     * Download the image data as a grid of weights and store them in a 2D array.
     */
    private void downloadImageData(String artUrl) {
    ...
    }

    /**
     * Generates an array with the rowkeys that were loaded into the specified Bigtable. This is
     * used to create the correct intervals for scanning equal sections of rowkeys. Since Bigtable
     * sorts keys lexicographically if we just used standard intervals, each section would have
     * different sizes.
     */
    private void generateRowkeys(long maxInput) {
    ...
    }

    /**
     * Get the ranges to scan for the given time index.
     */
    private List<RowRange> getRangesForTimeIndex(@Element Integer timeOffsetIndex, long maxInput) {
    ...
    }
  }

6. 正在製作精彩傑作

您已經瞭解如何將資料載入 Bigtable 並透過 Dataflow 讀取資料，接下來可以執行最後一個指令，這個指令會在 8 小時內產生《蒙娜麗莎》的圖片。

mvn compile exec:java -Dexec.mainClass=keyviz.ReadData \
"-Dexec.args=--bigtableProjectId=$BIGTABLE_PROJECT \
--bigtableInstanceId=$INSTANCE_ID --runner=dataflow \
--bigtableTableId=$TABLE_ID --project=$GOOGLE_CLOUD_PROJECT"

您可以使用有現有映像檔的值區。您也可以使用這項工具從自己的任何圖片中建立輸入檔案，然後上傳至公開 GCS 值區。

檔案名稱取自 gs://keyviz-art/[painting]_[hours]h.txt 範例：gs://keyviz-art/american_gothic_4h.txt

繪畫選項：

american_gothic
mona_lisa
pearl_earring
persistence_of_memory
starry_night
sunday_afternoon
the_scream

小時選項：1、4、8、12、24、48、72、96、120、144

將「Storage Object Viewer」角色授予 allUsers，即可公開您的 GCS 值區或檔案。

選好圖片後，只需變更這個指令中的 --file-path 參數即可：

mvn compile exec:java -Dexec.mainClass=keyviz.ReadData \
"-Dexec.args=--bigtableProjectId=$BIGTABLE_PROJECT \
--bigtableInstanceId=$INSTANCE_ID --runner=dataflow \
--bigtableTableId=$TABLE_ID --project=$GOOGLE_CLOUD_PROJECT \
--filePath=gs://keyviz-art/american_gothic_4h.txt"

7. 稍後再查看

完整圖片可能需要數小時才會生效，但 30 分鐘後，您就能開始在金鑰視覺化工具中看到活動。您可以搭配使用多種參數：縮放、亮度和指標。如要縮放，可以使用滑鼠上的滾輪，或是在按鍵視覺化工具格線中拖曳矩形。

亮度會變更圖片的縮放比例，如果你想要細看非常熱的區域，就很適合使用。

您也可以調整要顯示的指標。包括 OP、讀取位元組用戶端、寫入位元組用戶端等等。「讀取位元組用戶端」似乎在「營運」時產生平滑的圖片這種圖像會產生較多線條，在部分圖片上看起來很酷

8. 完成

清除所用資源，以免產生費用

如要避免系統向您的 Google Cloud Platform 帳戶收取這個程式碼研究室所用資源的費用，請刪除執行個體。

gcloud bigtable instances delete $INSTANCE_ID

涵蓋內容

使用 Dataflow 寫入 Bigtable
使用 Dataflow 從 Bigtable 讀取資料 (在管道一開始時，也就是管道中間)
使用 Dataflow 監控工具
使用 Bigtable 監控工具 (包括 Key Visualizer)

後續步驟

進一步瞭解 Key Visualizer 藝術的建立方式。
如要進一步瞭解 Cloud Bigtable，請參閱說明文件。
自行試用其他 Google Cloud Platform 功能。歡迎參考我們的教學課程。

回報錯誤