TurboManage

David Chandler's Journal of Java Web and Mobile Development

  • David M. Chandler

    Google Cloud Platform Data Engineering Instructor with ROI Training now residing in Colorado with the wife of my youth (31 years). Besides tech, I enjoy aviation and landscape photography.

  • Subscribe

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,120 other subscribers
  • Sleepless Nights…

    October 2023
    S M T W T F S
    1234567
    891011121314
    15161718192021
    22232425262728
    293031  
  • Blog Stats

    • 1,047,108 hits

How to compute total dataset size in BigQuery

Posted by David Chandler on October 4, 2023

BigQuery’s INFORMATION_SCHEMA views provide a ton of useful information about BQ usage. A simple query will let you see the total size of each table in a dataset and you can use the ROLLUP() operator to compute a grand total for the whole dataset:

SELECT
  table_id,
  -- Convert bytes to GB.
  sum(ROUND(size_bytes/pow(10,12),2)) as size_tb,
  sum(ROUND(row_count/pow(10,9),2)) as billion_rows
FROM
  -- Replace bigquery-public-data.wikipedia with your own dataset:
  `bigquery-public-data.wikipedia.__TABLES__`
WHERE table_id LIKE 'pageviews%'
GROUP BY ROLLUP(table_id)
ORDER BY table_id;

The query above shows the total size of all the pageviews tables in the wikipedia public dataset. At the time of this writing, these are the results. The first row is the grand total computed by the ROLLUP() function.

Querying all the pageviews tables with a wildcard operator would ingest 20TB and almost half a trillion rows! Be careful where you do that as it might cost you ~$100, but running the INFORMATION_SCHEMA query above is free as it only queries metadata.

Leave a comment