About this codelab
1. Introduction
Imagine being able to get your data ready for analysis faster and more efficiently without needing to be a coding expert. With BigQuery Data Preparation, that's a reality. This powerful feature simplifies data ingestion, transformation, and cleansing, putting data preparation into the hands of all data practitioners in your organization.
Ready to unlock the secrets hidden in your product data?
Prerequisites
- A basic understanding of the Google Cloud, Console
- A basic understanding of SQL
What you'll learn
- How BigQuery data preparation can clean and transform your raw data into actionable business intelligence, using a realistic example from the Fashion and Beauty industry.
- How to run and schedule the data preparation for your cleaned data
What you'll need
- A Google Cloud Account and Google Cloud Project
- A web browser such as Chrome
2. Basic setup and Requirements
Self-paced environment setup
- Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.
- The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
- The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as
PROJECT_ID
). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project. - For your information, there is a third value, a Project Number, which some APIs use. Learn more about all three of these values in the documentation.
- Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.
3. Before you begin
Enable API
To use Gemini in BigQuery, you must enable the Gemini for Google Cloud API. A service administrator or project owner with the serviceusage.services.enable
IAM permission typically performs this step.
- To enable the Gemini for Google Cloud API, go to the Gemini for Google Cloud page in the Google Cloud Marketplace. Go to Gemini for Google Cloud
- In the project selector, select a project.
- Click Enable. The page updates and shows a status of Enabled. Gemini in BigQuery is now available in the selected Google Cloud project to all users who have the required IAM permissions.
Set up roles and permissions to develop data preparations
- In IAM and Admin, select IAM
- Select your user and click on the pencil icon to ‘Edit principal'
To use BigQuery Data Preparation, you will need the following roles and permissions:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- Service Usage Consumer (roles/serviceusage.serviceUsageConsumer)
4. Finding and Subscribing to the "bq data preparation demo" Listing in BigQuery Analytics Hub
We'll be using the bq data preparation demo
dataset for this tutorial. It's a linked dataset in BigQuery Analytics Hub that we will be reading from.
Data preparation never writes back into the source and we'll ask you to define a destination table to write into. The table we'll be working with for this exercise only has 1,000 rows to keep costs minimal but data preparation runs on BigQuery and will scale alongside.
Follow these steps to find and subscribe to the linked dataset:
- Access Analytics Hub: In the Google Cloud Console, navigate to BigQuery.
- In the BigQuery navigation menu, under "Governance," select "Analytics Hub."
- Search for the Listing: In the Analytics Hub UI, click Search Listings."
- Type
bq data preparation demo
into the search bar and press Enter.
- Subscribe to the Listing: Select the
bq data preparation demo
listing from the search results. - On the listing details page, click the Subscribe button.
- Review any confirmation dialogs and update the project/dataset if needed. The defaults should be correct.
- Access the Dataset in BigQuery: Once you've successfully subscribed, the datasets within the listing will be linked to your BigQuery project.
Return to BigQuery Studio.
5. Explore the data and launch data preparation
- Locate the Dataset and Table: In the Explorer panel, select your project and then find the dataset that was included in the
bq data preparation demo
listing. Select thestg_product
table. - Open in Data Preparation: Click the three vertical dots next to the table name and select
Open in Data Preparation
.
This will open the table in the Data Preparation interface, ready for you to begin transforming your data.
As you can see in the data preview below, we have some data challenges we'll tackle, these include:
- The price column contains both the amount and currency, making analysis difficult.
- The product column mixes the product name and category (separated by a pipe symbol |).
Right away, Gemini analyzes your data and suggests several transformations. In this example, we see a number of recommendations. In the next steps we will apply the ones we need.
6. Handling the Price Column
Let's tackle the Price column. As we've seen, it contains both the currency and the amount. Our goal is to separate these into two distinct columns:Currency and Amount.
Gemini has identified several recommendations for the Price column.
- Find a recommendation that says something similar to:
Description: "This expression removes the leading ‘USD ' from the specified field"
REGEXP_REPLACE(Price,` `r'^USD\s',` `r'')
- Select Preview
- Select Apply
Next for the Price column, let's convert data type from STRING toNUMERIC.
- Find a recommendation that says something similar to:
Description: "Converts column Price from type string to float64"
SAFE_CAST(Price AS float64)
- Select Apply.
You should now see three applied steps in your step list.
7. Handling the Product Column
The product column contains both the product name and category, separated by a pipe (|).
While we could use natural language again, let's explore another of Gemini's powerful features.
Clean up the product name
- Select the category portion of a product entry including the
|
character and delete it.
Gemini will intelligently recognize this pattern and suggest a transformation to apply to the entire column.
- Select "Edit".
Gemini's recommendation is spot on: it removes everything after the ‘|'character, effectively isolating the product name.
But this time we don't want to overwrite our original data.
- In the target column dropdown, select "Create new column".
- Set the name to ProductName.
- Preview the changes to make sure everything looks good.
- Apply the Transformation.
Extract the product category
Using natural language, we'll instruct Gemini to extract the word after the pipe (|) in the Product column. This extracted value will be overwritten into the existing column called Product.
- Click on
Add Step
to add a new transformation step.
- Select
Transformation
from the dropdown menu - In the natural language prompt field, enter "extract the word after the pipe (|) in the Product column." then hit return to generate the SQL.
- Leave the Target Column as "Product".
- Click Apply.
The transformation should give you the following results.
8. Joining to enrich the data
Often, you'll want to enrich your data with information from other sources. In our example, we'll join our product data with extended product attributes, stg_extended_product
, from a third-party table. This table includes details like brand and launch date.
- Click on
Add Step
- Select
Join
- Browse to the
stg_extended_product
table.
Gemini in BigQuery automatically picked the productid join key for us and qualified the left and right hand sides since the key name is identical.
Note: Ensure that the description field reads ‘Join by productid'. If it includes additional join keys, overwrite the description field to ‘Join by productid' and select the generate button in the description field to re-generate the join expression with the following condition L.
productid
= R.
productid
.
- Optionally, select Preview to preview the results.
- Click
Apply
.
Cleaning up the extended attributes
While the join was successful, the extended attributes data requires some cleaning. The LaunchDate
column has inconsistent date formats, and the Brand
column contains some missing values.
We'll begin by addressing the LaunchDate
column.
Before creating any transformations, check Gemini's recommendations.
- Click on the
LaunchDate
column name. You should see some recommendations generated similar to the ones in the image below.
- If you see a recommendation with the following SQL then Apply the recommendation and skip the next steps.
COALESCE(SAFE.PARSE_DATE('%Y-%m-%d',
LaunchDate),SAFE.PARSE_DATE('%Y/%m/%d', LaunchDate))
- If you don't see a recommendation matching the SQL above, click
Add Step
. - Select
Transformation
. - In the SQL field, enter the following:
COALESCE(SAFE.PARSE_DATE('%Y-%m-%d',
LaunchDate),SAFE.PARSE_DATE('%Y/%m/%d', LaunchDate))
- Set the
Target Columns
toLaunchDate
. - Click
Apply
.
The LaunchDate column now has a consistent date format.
9. Adding a destination table
Our dataset is now clean and ready for loading into a dimension table in our data warehouse.
- Click
ADD STEP
. - Select
Destination
. - Fill in the required parameters: Dataset:
bq_data_preparation_demo
Table:DimProduct
- Click
Save
.
We've now worked with the "Data" and "Schema" tabs. In addition to these, BigQuery Data Preparation provides a "Graph" view that visually displays the sequence of transformation steps in your pipeline.
10. Bonus A: Handling Manufacturer column and creating Error Table
We've also identified empty values in the Manufacturer
column. For these records, we want to implement a data quality check and move them to an error table for further review.
Create an error table
- Click the
More
button next to thestg_product data preparation
title. - Under the
Setting
section, selectError Table
. - Check the
Enable error table
box, configure the settings as follows:
- Dataset: Select
bq_data_preparation_demo
- Table: Enter
err_dataprep
- Under
Define duration for keeping errors
, select30 days (default)
- Click
Save
.
Set up validation on the Manufacturer column
- Select the Manufacturer column.
- Gemini will have likely identified a relevant transformation. Find the recommendation that is keeping only rows where the Manufacturer field is not empty. It will have SQL similar to:
Manufacturer IS NOT NULL
2.Click the "Edit" button on this recommendation to review it.
- Check the "Failed validation rows go to error table" option if it is not checked
- Click
Apply
.
At any point, you can review, modify, or delete the transformations you've applied by clicking on the "Applied steps" button.
Clean up the redundant ProductID_1 column
The ProductID_1 column, which duplicates the ProductID from our joined table, can now be deleted.
- Go to the
Schema
tab - Click the 3 dots next to the
ProductID_1
column. - Click on
Drop
.
We're now ready to run the data preparation job and validate our entire pipeline. Once we're satisfied with the results, we can schedule the job to run automatically.
- Before navigating away from data preparation view, Save yourpreparations. Next to the
stg_product data preparation
title, you should see aSave
button. Click the button to save.
11. Clean up environment
- Delete the
stg_product data preparation
- Delete the
bq data preparation demo
dataset
12. Congratulations
Congratulations for completing the codelab.
What we've covered
- Setting up data preparation
- Opening tables and navigating data preparation
- Splitting columns with numerical and unit descriptor data
- Standardizing date formats
- Running data preparations