Query GitHub data using BigQuery

BigQuery is Google’s fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes of data without needing a database administrator or any infrastructure to manage. BigQuery uses familiar SQL and a pay-only-for-what-you-use charging model. BigQuery allows you to focus on analyzing data to find meaningful insights.

In this post we’ll see how to query the GitHub public dataset to grab hands on experience with it.

Sign-in to Google Cloud Platform console (console.cloud.google.com) and navigate to BigQuery. You can also open the BigQuery web UI directly by entering the following URL in your browser.

https://console.cloud.google.com/bigquery
  • Accept the terms of service.
  • Before you can use BigQuery, you must create a project. Follow the prompts to create your new project.

The project ID is a unique name across all Google Cloud projects. It will be referred to later in this codelab as PROJECT_ID.

This codelab uses BigQuery resources withing the BigQuery sandbox limits. A billing account is not required. If you later want to remove the sandbox limits, you can add a billing account by signing up for the Google Cloud Platform free trial.

 Preview GitHub Data

Open the GitHub dataset in the BigQuery web UI.

Get a quick preview of how the data looks.

Do not use SELECT * to view rows of a table. Even with a LIMIT clause, BigQuery scans all columns of all rows in the table, which will use up your free query quota.

SELECT subject AS subject,
  COUNT(*) AS num_duplicates
FROM `bigquery-public-data.github_repos.sample_commits`
GROUP BY subject
ORDER BY num_duplicates DESC
LIMIT 100

Since the GitHub dataset is large, it helps to use a smaller sample dataset while you’re experimenting to save on costs. Use the bytes processed below the editor to estimate the query cost.

Click the Run query button.

In a few seconds, the result will be listed in the bottom, and it’ll also tell you how much data was processed and how long it took:

Even though, the sample_commits table is 2.49 GB, the query only processed 35.8 MB. BigQuery only processes the bytes from the columns which are used in the query, so the total amount of data processed can be significantly less than the table size. With clustering and partitioning, the amount of data processed can be reduced even further.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s