How to use BigQuery in a Jupyter notebook?
When we try to analyze huge datasets (like blockchain data) through BigQuery, it is useful to run the data load and conversion in the cloud and use our local environment mostly for the final transformations and visualization. For these cases, a jupyter notebook seems to be a fitting environment.
There are at least three main ways by which we can access BigQuery data from a notebook:
- The most simple one is to use the
pandas-gbq
library (we already covered it in this previous post) - Using google's BigQuery notebook extension
- The BigQuery python API
Here we will introduce the last two versions.
(This post is based on this official tutorial.)
Compared to the pandas-gbq
library, we need to define the credentials explicitly.
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.expanduser("~/.credentials/Notebook bigquery-c422e406404b.json")
For the extension to work, we need to install the google-cloud-bigquery
library.
conda install -c conda-forge google-cloud-bigquery
Then, we load the bigquery extension
%load_ext google.cloud.bigquery
By using the %%bigquery
magic command, we immediately define the result of a query as a pandas dataframe.
Here, we pull the total Ethereum token spendings for each day during 2019.
%%bigquery token_transfers
SELECT
SUM(SAFE_CAST(value AS FLOAT64)) AS total_spent,
SAFE_CAST(EXTRACT(DATE FROM block_timestamp) AS DATETIME) AS date
FROM `bigquery-public-data.ethereum_blockchain.token_transfers`
WHERE
EXTRACT(YEAR FROM block_timestamp) = 2019
GROUP BY date
ORDER BY date DESC
token_transfers
import altair as alt
alt.data_transformers.disable_max_rows()
label = alt.selection_single(
# encodings=['x'],
on='mouseover',
nearest=True,
empty='none'
)
chart = alt.Chart().mark_line().encode(
alt.X('date:T'), alt.Y('total_spent:Q', scale=alt.Scale(type='log'))
)
alt.layer(
chart,
chart.mark_circle().encode(
opacity=alt.condition(label, alt.value(1), alt.value(0))
).add_selection(label),
alt.Chart().mark_rule(color='darkgray').encode(
x='date:T'
).transform_filter(label),
chart.mark_text(align='left', dx=5, dy=-5, stroke='white', strokeWidth=2).encode(
text=alt.Text('total_spent:Q', format=',.2e')
).transform_filter(label),
chart.mark_text(align='left', dx=5, dy=-5).encode(
text=alt.Text('total_spent:Q', format=',.2e')
).transform_filter(label),
data=token_transfers
).properties(width=600, height=400, title='Daily token spending during 2019 (log scale)')
from google.cloud import bigquery
client = bigquery.Client()
query ="""
SELECT
from_address,
SUM(SAFE_CAST(value AS FLOAT64)) AS total_spent,
AVG(SAFE_CAST(value AS INT64)) AS average_spent,
COUNT(1) AS times_spent
FROM `bigquery-public-data.ethereum_blockchain.token_transfers`
WHERE
EXTRACT(YEAR FROM block_timestamp) = 2019 AND
EXTRACT(MONTH FROM block_timestamp) = 09 AND
EXTRACT(DAY FROM block_timestamp) = 24 AND
EXTRACT(HOUR FROM block_timestamp) = 12
GROUP BY from_address
ORDER BY times_spent DESC"""
For this example we calculate the number of spendings and their total and average values from addresses over a hour.
df = client.query(query).to_dataframe()
df
alt.Chart(df).mark_rect().encode(
alt.X('times_spent:Q', bin=alt.BinParams(maxbins=20)),
alt.Y('total_spent:Q', bin=alt.BinParams(maxbins=20)),
alt.Color('count()'),
alt.Tooltip('count()')
).properties(title='Total spending value and spending frequency')
As expected, there are a few number of addresses responsible for the highest spending frequency and the highest spending value during that hour.