Migration from Cassandra to Bigtable with a Dual-Write Proxy

Migration from Cassandra to Bigtable with a Dual-Write Proxy

About this codelab

subjectLast updated Apr 15, 2025
account_circleWritten by Louis Cheynel

1. Introduction

Bigtable is a fully managed, high-performance NoSQL database service designed for large analytical and operational workloads. Migrating from existing databases like Apache Cassandra to Bigtable often requires careful planning to minimize downtime and application impact.

This codelab demonstrates a migration strategy from Cassandra to Bigtable using a combination of proxy tools:

  1. Cassandra-Bigtable Proxy: Allows Cassandra clients and tools (like cqlsh or drivers) to interact with Bigtable using the Cassandra Query Language (CQL) protocol by translating queries.
  2. Datastax Zero Downtime Migration (ZDM) Proxy: An open-source proxy that sits between your application and your database services (origin Cassandra and target Bigtable via the Cassandra-Bigtable Proxy). It orchestrates dual writes and manages traffic routing, enabling migration with minimal application changes and downtime.
  3. Cassandra Data Migrator (CDM): An open-source tool used for bulk migrating historical data from the source Cassandra cluster to the target Bigtable instance.

What you'll learn

  • How to set up a basic Cassandra cluster on Compute Engine.
  • How to create a Bigtable instance.
  • How to deploy and configure the Cassandra-Bigtable Proxy to map a Cassandra schema to Bigtable.
  • How to deploy and configure the Datastax ZDM Proxy for dual writes.
  • How to use the Cassandra Data Migrator tool to bulk-migrate existing data.
  • The overall workflow for a proxy-based Cassandra-to-Bigtable migration.

What you'll need

  • A Google Cloud project with billing enabled. New users are eligible for a free trial.
  • Basic familiarity with Google Cloud concepts like projects, Compute Engine, VPC networks, and firewall rules. Basic familiarity with Linux command-line tools.
  • Access to a machine with the gcloud CLI installed and configured, or use the Google Cloud Shell.

For this codelab, we will primarily use virtual machines (VMs) on Compute Engine within the same VPC network and region to simplify networking. Using internal IP addresses is recommended.

2. Set up your environment

1. Select or create a Google Cloud Project

Navigate to the Google Cloud Console and select an existing project or create a new one. Note your Project ID.

2. Enable required APIs

Ensure the Compute Engine API and Bigtable API are enabled for your project.

gcloud services enable compute.googleapis.com bigtable.googleapis.com bigtableadmin.googleapis.com --project=<your-project-id>

Replace with your actual project ID.

3. Choose a region and zone

Select a region and zone for your resources. We'll use us-central1 and us-central1-c as examples. Define these as environment variables for convenience:

export PROJECT_ID="<your-project-id>"
export REGION="us-central1"
export ZONE="us-central1-c"

gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
gcloud config set compute/zone $ZONE

4. Configure firewall rules

We need to allow communication between our VMs within the default VPC network on several ports:

  • Cassandra/Proxies CQL Port: 9042
  • ZDM Proxy Health Check Port: 14001
  • SSH: 22

Create a firewall rule to allow internal traffic on these ports. We'll use a tag cassandra-migration to easily apply this rule to relevant VMs.

gcloud compute firewall-rules create allow-migration-internal \
--network=default \
--action=ALLOW \
--rules=tcp:22,tcp:9042,tcp:14001 \
--source-ranges=10.128.0.0/9 # Adjust if using a custom VPC/IP range \
--target-tags=cassandra-migration

3. Deploy Cassandra cluster (Origin)

For this codelab, we'll set up a simple single-node Cassandra cluster on Compute Engine. In a real-world scenario, you would connect to your existing cluster.

1. Create a GCE VM for Cassandra

gcloud compute instances create cassandra-origin \
--machine-type=e2-medium \
--image-family=ubuntu-2004-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=20GB

2. Install Cassandra

# Install Java (Cassandra dependency)
sudo apt-get update
sudo apt-get install -y openjdk-11-jre-headless

# Add Cassandra repository
echo "deb [https://debian.cassandra.apache.org](https://debian.cassandra.apache.org) 41x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
curl [https://downloads.apache.org/cassandra/KEYS](https://downloads.apache.org/cassandra/KEYS) | sudo apt-key add -

# Install Cassandra
sudo apt-get update
sudo apt-get install -y cassandra

3. Create a keyspace and table

We'll use an employee table example and create a keyspace called "zdmbigtable".

cd ~/apache-cassandra
bin/cqlsh <your-localhost-ip? 9042  #starts the cql shell

Inside cqlsh:

-- Create keyspace (adjust replication for production)
CREATE KEYSPACE zdmbigtable WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};

-- Use the keyspace
USE zdmbigtable;

-- Create the employee table
CREATE TABLE employee (
    name text PRIMARY KEY,
    age bigint,
    code int,
    credited double,
    balance float,
    is_active boolean,
    birth_date timestamp
);

-- Exit cqlsh
EXIT;

Leave the SSH session open or note the IP address of this VM (hostname -I).

4. Set up Bigtable (Target)

Duration 0:01

Create a Bigtable instance. We'll use zdmbigtable as the instance ID.

gcloud bigtable instances create zdmbigtable \ 
--display-name="ZDM Bigtable Target" \ 
--cluster=bigtable-c1 \ 
--cluster-zone=$ZONE \ 
--cluster-num-nodes=1 # Use 1 node for dev/testing; scale as needed

The Bigtable table itself will be created later by the Cassandra-Bigtable Proxy setup script.

5. Set up Cassandra-Bigtable Proxy

1. Create Compute Engine VM for Cassandra-Bigtable Proxy

gcloud compute instances create bigtable-proxy-vm \ 
--machine-type=e2-medium \
--image-family=ubuntu-2004-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=20GB

SSH into the bigtable-proxy-vm:

gcloud compute ssh bigtable-proxy-vm

Inside the VM:

# Install Git and Go
sudo apt-get update
sudo apt-get install -y git golang-go

# Clone the proxy repository
# Replace with the actual repository URL if different
git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-ecosystem.git
cd cassandra-to-bigtable-proxy/

# Set Go environment variables
export GOPATH=$HOME/go
export PATH=$PATH:/usr/local/go/bin:$GOPATH/bin

2. Configure the proxy

nano config.yaml

Update the following variables. For more advanced configuration use this example provided on GitHub.

#!/bin/bash
cassandraToBigtableConfigs
:
  # Global default GCP Project ID
  projectId: <your-project-id>

listeners
:
- name: cluster1
  port: 9042
  bigtable:
    #If you want to use multiple instances then pass the instance names by comma seperated
    #Instance name should not contain any special characters except underscore(_)
    instanceIds: zdmbigtable

    # Number of grpc channels to be used for Bigtable session.
    Session:
      grpcChannels: 4

otel
:
  # Set enabled to true or false for OTEL metrics/traces/logs.
  enabled: False

  # Name of the collector service to be setup as a sidecar
  serviceName: cassandra-to-bigtable-otel-service

  healthcheck:
    # Enable the health check in this proxy application config only if the
    # "health_check" extension is added to the OTEL collector service configuration.
    #
    # Recommendation:
    # Enable the OTEL health check if you need to verify the collector's availability
    # at the start of the application. For development or testing environments, it can
    # be safely disabled to reduce complexity.

    # Enable/Disable Health Check for OTEL, Default 'False'.
    enabled: False
    # Health check endpoint for the OTEL collector service
    endpoint: localhost:13133
  metrics:
    # Collector service endpoint
    endpoint: localhost:4317

  traces:
    # Collector service endpoint
    endpoint: localhost:4317
    #Sampling ratio should be between 0 and 1. Here 0.05 means 5/100 Sampling ratio.
    samplingRatio: 1

loggerConfig
:
  # Specifies the type of output, here it is set to 'file' indicating logs will be written to a file.
  # Value of `outputType` should be `file` for file type or `stdout` for standard output.
  # Default value is `stdout`.
  outputType: stdout
  # Set this only if the outputType is set to `file`.
  # The path and name of the log file where logs will be stored. For example, output.log, Required Key.
  # Default `/var/log/cassandra-to-spanner-proxy/output.log`.
  fileName: output/output.log
  # Set this only if the outputType is set to `file`.
  # The maximum size of the log file in megabytes before it is rotated. For example, 500 for 500 MB.
  maxSize: 10
  # Set this only if the outputType is set to `file`.
  # The maximum number of backup log files to keep. Once this limit is reached, the oldest log file will be deleted.
  maxBackups: 2
  # Set this only if the outputType is set to `file`.
  # The maximum age in days for a log file to be retained. Logs older than this will be deleted. Required Key.
  # Default 3 days
  maxAge: 1

  # Set this only if the outputType is set to `file`.
  # Default value is set to 'False'. Change the value to 'True', if log files are required to be compressed.
  compress: True

Save and close the file (ctrl+X, then Y, then Enter in nano).

3. Start the Cassandra-Bigtable Proxy

Start the proxy server.

# At the root of the cassandra-to-bigtable-proxy directory
go run proxy.go

The proxy will start and listen on port 9042 for incoming CQL connections. Keep this terminal session running. Note the IP address of this VM (hostname -I)

4. Create Table via CQL

Connect cqlsh to the Cassandra-Bigtable Proxy VM's IP address.

In cqlsh run the following command

-- Create the employee table
CREATE TABLE zdmbigtable.employee (
    name text PRIMARY KEY,
    age bigint,
    code int,
    credited double,
    balance float,
    is_active boolean,
    birth_date timestamp
);

Verify in the Google Cloud Console that the employee table and metadata table exist in your Bigtable instance.

6. Set up the ZDM Proxy

The ZDM Proxy requires at least two machines: one or more proxy nodes that handle the traffic, and a "Jumphost" used for deployment and orchestration via Ansible.

1. Create Compute Engine VMs for the ZDM Proxy

We need two VMs: zdm-proxy-jumphost and zdm-proxy-node-1

# Jumphost VM 
gcloud compute instances create zdm-jumphost \
--machine-type=e2-medium \
--image-family=ubuntu-2004-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=20GB

# Proxy Node VM 
gcloud compute instances create zdm-proxy-node-1 \
--machine-type=e2-standard-8 \
--image-family=ubuntu-2004-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=20GB

Note the IP addresses of both VMs.

2. Prepare the jumphost

SSH into the zdm-jumphost

gcloud compute ssh zdm-jumphost
# Install Git and Ansible

sudo apt
-get update
sudo apt
-get install -y git ansible

Inside the jumphost

git clone https:\/\/github.com/datastax/zdm-proxy-automation.git 

cd zdm
-proxy-automation/ansible/

Edit the main configuration file vars/zdm_proxy_cluster_config.yml:

Update the origin_contact_points and target_contact_points with the internal IP addresses of your Cassandra VM and Cassandra-Bigtable Proxy VM, respectively. Leave authentication commented out as we didn't set it up.

##############################
#### ORIGIN CONFIGURATION ####
##############################
## Origin credentials (leave commented if no auth)
# origin_username: ...
# origin_password: ...

## Set the following two parameters only if Origin is a self-managed, non-Astra cluster
origin_contact_points
: <Your-Cassandra-VM-Internal-IP> # Replace!
origin_port
: 9042

##############################
#### TARGET CONFIGURATION ####
##############################
## Target credentials (leave commented if no auth)
# target_username: ...
# target_password: ...

## Set the following two parameters only if Target is a self-managed, non-Astra cluster
target_contact_points
: <Your-Bigtable-Proxy-VM-Internal-IP> # Replace!
target_port
: 9042

# --- Other ZDM Proxy settings can be configured below ---
# ... (keep defaults for this codelab)

Save and close this file.

3. Deploy the ZDM Proxy using Ansible

Run the Ansible playbook from the ansible directory on the jumphost:

ansible-playbook deploy_zdm_proxy.yml -i zdm_ansible_inventory

This command will install necessary software (like Docker) on the proxy node (zdm-proxy-node-1), pull the ZDM Proxy Docker image, and start the proxy container with the configuration you provided.

4. Verify ZDM Proxy health

Check the readiness endpoint of the ZDM proxy running on zdm-proxy-node-1 (port 14001) from the jumphost:

# Replace <zdm-proxy-node-1-internal-ip> with the actual internal IP.
curl
-G http://<zdm-proxy-node-1-internal-ip>:14001/health/readiness

You should see output similar to this, indicating both Origin (Cassandra) and Target (Cassandra-Bigtable Proxy) are UP:

{
 
"OriginStatus": {
   
"Addr": "<Your-Cassandra-VM-Internal-IP>:9042",
   
"CurrentFailureCount": 0,
   
"FailureCountThreshold": 1,
   
"Status": "UP"
 
},
 
"TargetStatus": {
   
"Addr": "<Your-Bigtable-Proxy-VM-Internal-IP>:9042",
   
"CurrentFailureCount": 0,
   
"FailureCountThreshold": 1,
   
"Status": "UP"
 
},
 
"Status": "UP"
}

7. Configure application & start dual writes

Duration 0:05

At this stage in a real migration, you would reconfigure your application(s) to point to the ZDM Proxy node's IP address (e.g., :9042) instead of directly connecting to Cassandra.

Once the application connects to the ZDM Proxy: Reads are served from the Origin (Cassandra) by default. Writes are sent to both the Origin (Cassandra) and the Target (Bigtable, via the Cassandra-Bigtable Proxy). This enables your application to continue functioning normally while ensuring new data is written to both databases simultaneously. You can test the connection using cqlsh pointed at the ZDM Proxy from the jumphost or another VM in the network:

Cqlsh <zdm-proxy-node-1-ip-address> 9042

Try insert some data:

INSERT INTO zdmbigtable.employee (name, age, is_active) VALUES ('Alice', 30, true); 
SELECT * FROM employee WHERE name = 'Alice';

This data should be written to both Cassandra and Bigtable. You can confirm this in Bigtable, by going to the Google Cloud Console and opening the Bigtable Query Editor for your instance. Run a "SELECT * FROM employee" query, and the data recently inserted should be visible.

8. Migrate historical data using Cassandra Data Migrator

Now that dual writes are active for new data, use the Cassandra Data Migrator (CDM) tool to copy the existing historical data from Cassandra to Bigtable.

1. Create Compute Engine VM for CDM

This VM needs sufficient memory for Spark.

gcloud compute instances create cdm-migrator-vm \
--machine-type=e2-medium \
--image-family=ubuntu-2004-lts \
--image-project=ubuntu-os-cloud \
--tags=cassandra-migration \
--boot-disk-size=40GB

2. Install prerequisites (Java 11, Spark)

SSH into the cdm-migrator-vm:

gcloud compute ssh cdm-migrator-vm

Inside the VM:

# Install Java 11 
sudo apt-get update
sudo apt-get install -y openjdk-11-jdk
 
# Verify Java installation
java -version

# Download and Extract Spark (Using version 3.5.3 as requested)
# Check the Apache Spark archives for the correct URL if needed

wget  [https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3-scala2.13.tgz](https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3-scala2.13.tgz) tar -xvzf spark-3.5.3-bin-hadoop3-scala2.13.tgz
 
export SPARK_HOME=$PWD/spark-3.5.3-bin-hadoop3-scala2.13
export PATH=$PATH:$SPARK_HOME/bin

3. Download Cassandra Data Migrator

Download the CDM tool jar file. Check the Cassandra Data Migrator GitHub Release page for the correct URL of the desired version.

# Example using version 5.2.2 - replace URL if needed
wget https://github.com/datastax/cassandra-data-migrator/releases/download/v5.2.2/cassandra-data-migrator-5.2.2.jar)

4. Configure CDM

Create a properties file named cdm.properties

Nano cdm.properties

Paste the following configuration, replacing the IP addresses and disabling TTL/Writetime features as they are not directly supported by Bigtable in the same way. Leave auth commented out.

# Origin Cassandra Connection 
spark
.cdm.connect.origin.host <Your-Cassandra-VM-IP-Address> # Replace!
spark
.cdm.connect.origin.port 9042
spark
.cdm.connect.origin.username cassandra # Leave default, or set if auth is enabled #
spark
.cdm.connect.origin.password cassandra # Leave default, or set if auth is enabled #

# Target Bigtable (via Cassandra-Bigtable Proxy)
Connection spark.cdm.connect.target.host <Your-Bigtable-Proxy-VM-IP-Address> # Replace!
spark
.cdm.connect.target.port 9042
spark
.cdm.connect.target.username cassandra # Leave default, or set if auth is enabled #
spark
.cdm.connect.target.password cassandra # Leave default, or set if auth is enabled #

# Disable TTL/Writetime features (Important for Bigtable compatibility via Proxy)
spark
.cdm.feature.origin.ttl.automatic false
spark
.cdm.feature.origin.writetime.automatic false
spark
.cdm.feature.target.ttl.automatic false
spark
.cdm.feature.target.writetime.automatic false

Save and close the file.

5. Run the migration job

Execute the migration using spark-submit. This command tells Spark to run the CDM jar, using your properties file and specifying the keyspace and table to migrate. Adjust memory settings (–driver-memory, –executor-memory) based on your VM size and data volume.

Make sure you are in the directory containing the CDM jar and properties file. Replace ‘cassandra-data-migrator-5.2.2.jar' if you downloaded a different version.

./spark-3.5.3-bin-hadoop3-scala2.13/bin/spark-submit \ --properties-file cdm.properties \ --master "local[*]" \ --driver-memory 4G \ --executor-memory 4G \ --class com.datastax.cdm.job.Migrate \ cassandra-data-migrator-5.2.2.jar &> cdm_migration_$(date +%Y%m%d_%H%M).log

The migration will run in the background, and logs will be written to cdm_migration_... .log. Monitor the log file for progress and any errors:

tail -f cdm_migration_*.log

6. Verify data migration

Once the CDM job completes successfully, verify that the historical data exists in Bigtable. Since the Cassandra-Bigtable Proxy allows CQL reads, you can again use cqlsh connected to the ZDM Proxy (which routes reads to the Target after migration, or can be configured to) or directly to the Cassandra-Bigtable Proxy to query the data. Connect via ZDM Proxy:

cqlsh <zdm-proxy-node-1-ip-address> 9042

Inside cqlsh:

SELECT COUNT(*) FROM zdmbigtable.employee; -- Check row count matches origin 
SELECT * FROM employee LIMIT 10; -- Check some sample data

Alternatively, use the cbt tool (if installed on the CDM VM or Cloud Shell) to look up data directly in Bigtable:

# First, install cbt if needed
# gcloud components update
# gcloud components install cbt

# Then lookup a specific row (replace 'some_employee_name' with an actual primary key)
cbt
-project $PROJECT_ID -instance zdmbigtable lookup employee some_employee_name

9. Cutover (conceptual)

After thoroughly verifying data consistency between Cassandra and Bigtable, you can proceed with the final cutover.

With the ZDM Proxy, the cutover involves reconfiguring it to primarily read from the target (Bigtable) instead of the Origin (Cassandra). This is typically done via ZDM Proxy's configuration, effectively shifting your application's read traffic to Bigtable.

Once you are confident that Bigtable is serving all traffic correctly, you can eventually:

  • Stop dual writes by reconfiguring the ZDM Proxy.
  • Decommission the original Cassandra cluster.
  • Remove the ZDM Proxy and have the application connect directly to the Cassandra-Bigtable Proxy or use the native Bigtable CQL Client for Java.

The specifics of ZDM Proxy reconfiguration for cutover are beyond this basic codelab but are detailed in the Datastax ZDM documentation.

10. Clean up

To avoid incurring charges, delete the resources created during this codelab.

1. Delete Compute Engine VMs

gcloud compute instances delete cassandra-origin-vm zdm-proxy-jumphost zdm-proxy-node-1 bigtable-proxy-vm cdm-migrator-vm --zone=$ZONE --quiet

2. Delete Bigtable instance

gcloud bigtable instances delete zdmbigtable

3. Delete Firewall rules

gcloud compute firewall-rules delete allow-migration-internal

4. Delete Cassandra database (if installed locally or persisted)

If you installed Cassandra outside of a Compute Engine VM created here, follow appropriate steps to remove the data or uninstall Cassandra.

11. Congratulations!

You have successfully walked through the process of setting up a proxy-based migration path from Apache Cassandra to Bigtable!

You learned how to:

Deploy Cassandra and Bigtable.

  • Configure the Cassandra-Bigtable Proxy for CQL compatibility.
  • Deploy the Datastax ZDM Proxy to manage dual writes and traffic.
  • Use the Cassandra Data Migrator to move historical data.

This approach allows for migrations with minimal downtime and no code changes by leveraging the proxy layer.

Next steps

  • Explore Bigtable Documentation
  • Consult the Datastax ZDM Proxy documentation for advanced configurations and cutover procedures.
  • Review the Cassandra-Bigtable Proxy repository for more details.
  • Check the Cassandra Data Migrator repository for advanced usage.
  • Try other Google Cloud Codelabs