In this post I go through the main steps of how to calculate a simple univariate linear regression model.

For the example I try to predict Ethereum daily average gas prices from daily average transaction values. I pull the data from the public Google data base with BigQuery.

Code and inspiration are based on Jason Brownlee's "Machine Learning Algorithms from Scratch" book

Libraries and data load

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=os.path.expanduser("~/.credentials/Notebook bigquery-c422e406404b.json")
from google.cloud import bigquery
client = bigquery.Client()
import altair as alt
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
query ="""
SELECT
    EXTRACT(DATE FROM block_timestamp) AS date,
    AVG(value) AS value,
    AVG(gas_price) AS gas_price,    
FROM `bigquery-public-data.ethereum_blockchain.transactions`
WHERE
    EXTRACT(YEAR FROM block_timestamp) = 2019
GROUP BY date
ORDER BY date
"""
values = client.query(query).to_dataframe(dtypes={'value': float, 'gas_price': float}, date_as_object=False)
values.head()
date value gas_price
0 2019-01-01 3.719103e+18 1.431514e+10
1 2019-01-02 4.649915e+18 1.349952e+10
2 2019-01-03 4.188781e+18 1.269504e+10
3 2019-01-04 6.958368e+18 1.418197e+10
4 2019-01-05 8.167590e+18 2.410475e+10

Calculating linear regression

Linear regression makes predictions with the help of linear coefficient. For an univariate case, this is expressed as:

$ \hat{y} = b_0 + b_1 x $

Coefficients

We can estimate the $b_0$ and $b_1$ coeffients in the following ways:

$ b_1 = \frac { \sum_{i = 1}^{n} (x_i - \bar{x} ) (y_i - \bar{y} ) } { \sum_{i=1}^{n} (x_i - \bar{x})^2 } $

$ b_0 = \bar{y} - b_1 \bar{x} $

Where

  • $x$: the predictor variable
  • $y$: the variable to predict
  • $ \bar{x} $ and $\bar{y}$ are their respective means

Covariance and variance

We can also can get the $b_1$ coefficient from the variance and covariance:

  • Covariance: $ \text{Cov}(x, y) = \frac { \sum_{i = 1}^{n} (x_i - \bar{x} ) (y_i - \bar{y} ) } { n } $
  • Variance: $ \delta^2 = \frac { { \sum_{i=1}^{n} (x_i - \bar{x})^2 } } { n } $

From these, we can get

$ b_1 = \frac { \text{Cov} } { \delta^2 } $

Calculation steps

Accordingly, we need to calculate the following metrics:

  1. Variable means for both $x$ and $y$
  2. Their deviations from the mean
  3. Covariance of $x$ and $y$
  4. Variance of $x$
  5. The $b_1$ coeffcient
  6. The $b_0$ coefficient
  7. $\hat{y}$, that is, the predictions

Means

values.mean()
value        3.173648e+18
gas_price    1.616874e+10
dtype: float64

Deviations from the mean

def deviation(array):
    return array - array.mean()
deviation(values['value'])
0      5.454550e+17
1      1.476267e+18
2      1.015132e+18
3      3.784720e+18
4      4.993942e+18
           ...     
360   -1.004019e+18
361   -1.259703e+18
362   -1.050454e+18
363   -6.660476e+17
364    5.699485e+17
Name: value, Length: 365, dtype: float64

Covariance

def covariance(arr1, arr2):
    return (deviation(arr1) * deviation(arr2)).sum() / len(arr1)
covariance(values['value'], values['gas_price'])
2.8148882775011114e+27

Variance

def variance(arr):
    return (deviation(arr) ** 2).sum() / len(arr)
variance(values['value'])
1.3829873312251886e+36

The $b_1$ coefficient

def b1(arr1, arr2):
    return covariance(arr1, arr2) / variance(arr1)
b1(values['value'], values['gas_price'])
2.035368086132359e-09

The $b_0$ coefficient

def b0(arr1, arr2):
    return arr2.mean() - b1(arr1, arr2) * arr1.mean()
b_0 = b0(values['value'], values['gas_price'])
b_0
9709198058.151459

Predictions

values['predictions'] = b_0 + values['value'] * b_1
values['predictions']
0      1.727894e+10
1      1.917349e+10
2      1.823491e+10
3      2.387204e+10
4      2.633325e+10
           ...     
360    1.412519e+10
361    1.360478e+10
362    1.403068e+10
363    1.481309e+10
364    1.732880e+10
Name: predictions, Length: 365, dtype: float64
values.head()
date value gas_price predictions
0 2019-01-01 3.719103e+18 1.431514e+10 1.727894e+10
1 2019-01-02 4.649915e+18 1.349952e+10 1.917349e+10
2 2019-01-03 4.188781e+18 1.269504e+10 1.823491e+10
3 2019-01-04 6.958368e+18 1.418197e+10 2.387204e+10
4 2019-01-05 8.167590e+18 2.410475e+10 2.633325e+10

Plotting the results

First we transform the data into long format

to_plot = values.melt(id_vars=['date'], value_vars=['gas_price', 'predictions'], var_name='status')
to_plot.head()
date status value
0 2019-01-01 gas_price 1.431514e+10
1 2019-01-02 gas_price 1.349952e+10
2 2019-01-03 gas_price 1.269504e+10
3 2019-01-04 gas_price 1.418197e+10
4 2019-01-05 gas_price 2.410475e+10

Then, we plot the results.

chart = alt.Chart().mark_line().encode(
    alt.X('date:T', axis=alt.Axis(format=("%x"), labelAngle=270)),
    alt.Y('value:Q', axis=alt.Axis(format=",.2e"), scale=alt.Scale(type='log')), color=alt.Color('status:N')
).properties(width=600)

label = alt.selection_single(
    encodings=['x'],
    on='mouseover',
    nearest=True,
    empty='none'
)


alt.layer(
    chart,


    chart.mark_circle().encode(
        opacity=alt.condition(label, alt.value(1), alt.value(0))
    ).add_selection(label),
    
    alt.Chart().mark_rule(color='darkgray').encode(
        x='date:T'
    ).transform_filter(label),
    
    chart.mark_text(align='left', dx=5, dy=-5, stroke='black', strokeWidth=0.5).encode(
        text=alt.Text('value:Q', format=',.2e')
    ).transform_filter(label),

    chart.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.Text('value:Q', format=',.2e')
    ).transform_filter(label),
    data=to_plot

).properties(title='Gas prices and their predictions (log scale)', width=600, height=400)

It is a bit hard to assess the performance of the results just by looking at them, but at least it seems to generate values within the same ballpark. We could generate metrics as RMSE for reference.

Obvious improvements

  • remove the outliers
  • include past values