When we try to analyze huge datasets (like blockchain data) through BigQuery, it is useful to run the data load and conversion in the cloud and use our local environment mostly for the final transformations and visualization. For these cases, a jupyter notebook seems to be a fitting environment.

There are at least three main ways by which we can access BigQuery data from a notebook:

  1. The most simple one is to use the pandas-gbq library (we already covered it in this previous post)
  2. Using google's BigQuery notebook extension
  3. The BigQuery python API

Here we will introduce the last two versions.

(This post is based on this official tutorial.)

Setting credentials

Compared to the pandas-gbq library, we need to define the credentials explicitly.

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.expanduser("~/.credentials/Notebook bigquery-c422e406404b.json")

BigQuery extension

For the extension to work, we need to install the google-cloud-bigquery library.

conda install -c conda-forge google-cloud-bigquery

Then, we load the bigquery extension

%load_ext google.cloud.bigquery

By using the %%bigquery magic command, we immediately define the result of a query as a pandas dataframe.

Here, we pull the total Ethereum token spendings for each day during 2019.

%%bigquery token_transfers
SELECT
    SUM(SAFE_CAST(value AS FLOAT64)) AS total_spent,
    SAFE_CAST(EXTRACT(DATE FROM block_timestamp) AS DATETIME) AS date
FROM `bigquery-public-data.ethereum_blockchain.token_transfers`
WHERE
    EXTRACT(YEAR FROM block_timestamp) = 2019
GROUP BY date
ORDER BY date DESC
token_transfers
total_spent date
0 2.331472e+79 2019-12-31
1 4.078401e+79 2019-12-30
2 1.773262e+79 2019-12-29
3 3.553959e+79 2019-12-28
4 2.751157e+79 2019-12-27
... ... ...
360 4.226642e+78 2019-01-05
361 4.122376e+80 2019-01-04
362 1.804472e+80 2019-01-03
363 5.833122e+78 2019-01-02
364 4.245749e+78 2019-01-01

365 rows × 2 columns

import altair as alt
alt.data_transformers.disable_max_rows()


label = alt.selection_single(
    # encodings=['x'],
    on='mouseover',
    nearest=True,
    empty='none'
)


chart = alt.Chart().mark_line().encode(
    alt.X('date:T'), alt.Y('total_spent:Q', scale=alt.Scale(type='log'))
)


alt.layer(
    chart,

    chart.mark_circle().encode(
        opacity=alt.condition(label, alt.value(1), alt.value(0))
    ).add_selection(label),

    alt.Chart().mark_rule(color='darkgray').encode(
        x='date:T'
    ).transform_filter(label),

    chart.mark_text(align='left', dx=5, dy=-5, stroke='white', strokeWidth=2).encode(
        text=alt.Text('total_spent:Q', format=',.2e')
    ).transform_filter(label),

    chart.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.Text('total_spent:Q', format=',.2e')
    ).transform_filter(label),

    data=token_transfers
).properties(width=600, height=400, title='Daily token spending during 2019 (log scale)')

BigQuery module

from google.cloud import bigquery
client = bigquery.Client()
query ="""
SELECT
    from_address,
    SUM(SAFE_CAST(value AS FLOAT64)) AS total_spent,
    AVG(SAFE_CAST(value AS INT64)) AS average_spent,
    COUNT(1) AS times_spent  
FROM `bigquery-public-data.ethereum_blockchain.token_transfers`
WHERE
    EXTRACT(YEAR FROM block_timestamp) = 2019 AND
    EXTRACT(MONTH FROM block_timestamp) = 09 AND
    EXTRACT(DAY FROM block_timestamp) = 24 AND
    EXTRACT(HOUR FROM block_timestamp) = 12
GROUP BY from_address
ORDER BY times_spent DESC"""

For this example we calculate the number of spendings and their total and average values from addresses over a hour.

df = client.query(query).to_dataframe()
df
from_address total_spent average_spent times_spent
0 0x7ba732b1bb952155b720250b477ce154e19ad62f 1.686990e+12 1.963900e+09 859
1 0x262e4155e8c5519e668ec26f353d75dd9c18e78f 3.132121e+23 NaN 365
2 0xbd8da72e2f42f5c68b59ee02c2245599ccd702dc 2.592921e+24 NaN 294
3 0x0000000000000000000000000000000000000000 7.401857e+24 3.058231e+17 268
4 0xa71c8bae673f99ac6c0f32c56efc89a8ddb9a501 3.896125e+12 1.504295e+10 259
... ... ... ... ...
5279 0xcd338611d74243844f3190b621eb781db53d20b4 1.630439e+23 NaN 1
5280 0xa9d6b0ad82e46db1895a412ec96b00e18bf95b49 1.000000e+09 1.000000e+09 1
5281 0x4aee792a88edda29932254099b9d1e06d537883f 5.740766e+22 NaN 1
5282 0x71e29ec9e13a39062269fc5c8cba155bb850b23a 2.270000e+10 2.270000e+10 1
5283 0x140d6fac06496b21efd086e107d5eca1a16592b3 4.335616e+08 4.335616e+08 1

5284 rows × 4 columns

alt.Chart(df).mark_rect().encode(
    alt.X('times_spent:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('total_spent:Q', bin=alt.BinParams(maxbins=20)),
    alt.Color('count()'),
    alt.Tooltip('count()')
).properties(title='Total spending value and spending frequency')

As expected, there are a few number of addresses responsible for the highest spending frequency and the highest spending value during that hour.