TurboManage

David Chandler's Journal of Java Web and Mobile Development

  • David M. Chandler

    Google Cloud Platform Data Engineering Instructor with ROI Training now residing in Colorado with the wife of my youth (31 years). Besides tech, I enjoy aviation and landscape photography.

  • Subscribe

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,120 other subscribers
  • Sleepless Nights…

    June 2024
    S M T W T F S
     1
    2345678
    9101112131415
    16171819202122
    23242526272829
    30  
  • Blog Stats

    • 1,047,110 hits

Archive for the ‘Google Cloud Platform’ Category

How to compute total dataset size in BigQuery

Posted by David Chandler on October 4, 2023

BigQuery’s INFORMATION_SCHEMA views provide a ton of useful information about BQ usage. A simple query will let you see the total size of each table in a dataset and you can use the ROLLUP() operator to compute a grand total for the whole dataset:

SELECT
  table_id,
  -- Convert bytes to GB.
  sum(ROUND(size_bytes/pow(10,12),2)) as size_tb,
  sum(ROUND(row_count/pow(10,9),2)) as billion_rows
FROM
  -- Replace bigquery-public-data.wikipedia with your own dataset:
  `bigquery-public-data.wikipedia.__TABLES__`
WHERE table_id LIKE 'pageviews%'
GROUP BY ROLLUP(table_id)
ORDER BY table_id;

The query above shows the total size of all the pageviews tables in the wikipedia public dataset. At the time of this writing, these are the results. The first row is the grand total computed by the ROLLUP() function.

Querying all the pageviews tables with a wildcard operator would ingest 20TB and almost half a trillion rows! Be careful where you do that as it might cost you ~$100, but running the INFORMATION_SCHEMA query above is free as it only queries metadata.

Posted in BigQuery, Google Cloud Platform | Tagged: | Leave a Comment »

Estimating BigQuery slot usage for a query

Posted by David Chandler on October 3, 2023

One of the most common questions I’m asked when teaching Data Engineering on Google Cloud Platform is how to find out how many slots BigQuery used to run a query. While there is no direct measurement available on a per query basis, the INFORMATION_SCHEMA table documentation provides a way to estimate the number of slots used. You simply divide the total slot time, which is the time the query ran across all workers, by the elapsed time of the query. The query below will show the estimated number of slots for the 10 most recent queries. Alternatively, you can put a BigQuery job ID in the WHERE clause to see the data for a specific query. Obtain the job ID from the Job Information tab in the results or from a log entry in the Cloud Logs Explorer.

# Avg slots per query
# See https://cloud.google.com/bigquery/docs/information-schema-jobs#estimate_slot_usage_and_cost_for_queries
SELECT query, job_id, start_time,
-- Average slot utilization per job is calculated by dividing total_slot_ms by the millisecond duration of the job
SAFE_DIVIDE(job.total_slot_ms,(TIMESTAMP_DIFF(job.end_time, job.start_time, MILLISECOND))) AS job_avg_slots,
user_email,
ROUND(total_bytes_processed / POW(10,9), 2) AS gb_processed,
total_slot_ms, reservation_id
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT job
-- WHERE job_id = "your_job_id"
ORDER BY gb_processed DESC
LIMIT 10

Here are some sample results:

BigQuery results showing average slots per query

Note that the BigQuery Admin console as well as the Cloud Monitoring section of the cloud console will show the number of queries in flight and slot utilization across your entire organization. However, when you’re just dying to know how many slots an individual query probably used, the above query will do the trick.

Posted in BigQuery | Tagged: | Leave a Comment »

Chrome 83 new feature might “break” some Web apps temporarily

Posted by David Chandler on June 16, 2020

Chrome m83 includes a new feature to block third-party cookies by default in Incognito windows (https://venturebeat.com/2020/05/19/google-chrome-83/). This is no doubt a good thing for browser security; however, it’s a problem for some apps like bigquerygeoviz that may rely on services from other domains (in this case, bigquerygeoviz is hosted at appspot.com but is making API calls to google.com). Only workaround for now is to turn the setting off when you launch an incognito window, like this:

Many technical trainers like myself use incognito windows for demos and keeping numerous credentials separate, so we may have to get used to changing this setting for a while.

Posted in Google Cloud Platform | Leave a Comment »

Google Cloud extends free Qwiklabs and Pluralsight in May

Posted by David Chandler on May 1, 2020

Google Cloud has extended several free training offers for May, including another 30 free days of Qwiklabs. We use Qwiklabs in all the courses that I teach. They are an excellent way to get hands on with Google Cloud Platform, and Google has made the entire self-paced labs catalog (over 400 labs) available for free this May. If you have some time for learning new things this month, I strongly encourage you to take advantage of this offer!

Posted in Google Cloud Platform | Leave a Comment »

Creating a COVID-19 data pipeline and dashboard with Google Cloud Platform

Posted by David Chandler on April 29, 2020

I’ve created a pipeline and dashboard using Google BigQuery and related tools (GeoViz and DataStudio) to daily import the NYT dataset for each county. This pipeline and dashboard cost literally less than a penny per month to operate in Google Cloud. You can find complete instructions in the GoogleCloudPlatform github repo, or just link directly to my COVID-19 US dashboard to explore the data.

Posted in Google Cloud Platform | Tagged: , , | 2 Comments »