Linear regression 'from scratch' to predict Ethereum gas prices from transaction values
In this post I go through the main steps of how to calculate a simple univariate linear regression model.
For the example I try to predict Ethereum daily average gas prices from daily average transaction values. I pull the data from the public Google data base with BigQuery.
Code and inspiration are based on Jason Brownlee's "Machine Learning Algorithms from Scratch" book
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=os.path.expanduser("~/.credentials/Notebook bigquery-c422e406404b.json")
from google.cloud import bigquery
client = bigquery.Client()
import altair as alt
alt.data_transformers.disable_max_rows()
query ="""
SELECT
EXTRACT(DATE FROM block_timestamp) AS date,
AVG(value) AS value,
AVG(gas_price) AS gas_price,
FROM `bigquery-public-data.ethereum_blockchain.transactions`
WHERE
EXTRACT(YEAR FROM block_timestamp) = 2019
GROUP BY date
ORDER BY date
"""
values = client.query(query).to_dataframe(dtypes={'value': float, 'gas_price': float}, date_as_object=False)
values.head()
Linear regression makes predictions with the help of linear coefficient. For an univariate case, this is expressed as:
$ \hat{y} = b_0 + b_1 x $
Coefficients
We can estimate the $b_0$ and $b_1$ coeffients in the following ways:
$ b_1 = \frac { \sum_{i = 1}^{n} (x_i - \bar{x} ) (y_i - \bar{y} ) } { \sum_{i=1}^{n} (x_i - \bar{x})^2 } $
$ b_0 = \bar{y} - b_1 \bar{x} $
Where
- $x$: the predictor variable
- $y$: the variable to predict
- $ \bar{x} $ and $\bar{y}$ are their respective means
Covariance and variance
We can also can get the $b_1$ coefficient from the variance and covariance:
- Covariance: $ \text{Cov}(x, y) = \frac { \sum_{i = 1}^{n} (x_i - \bar{x} ) (y_i - \bar{y} ) } { n } $
- Variance: $ \delta^2 = \frac { { \sum_{i=1}^{n} (x_i - \bar{x})^2 } } { n } $
From these, we can get
$ b_1 = \frac { \text{Cov} } { \delta^2 } $
Calculation steps
Accordingly, we need to calculate the following metrics:
- Variable means for both $x$ and $y$
- Their deviations from the mean
- Covariance of $x$ and $y$
- Variance of $x$
- The $b_1$ coeffcient
- The $b_0$ coefficient
- $\hat{y}$, that is, the predictions
values.mean()
def deviation(array):
return array - array.mean()
deviation(values['value'])
def covariance(arr1, arr2):
return (deviation(arr1) * deviation(arr2)).sum() / len(arr1)
covariance(values['value'], values['gas_price'])
def variance(arr):
return (deviation(arr) ** 2).sum() / len(arr)
variance(values['value'])
def b1(arr1, arr2):
return covariance(arr1, arr2) / variance(arr1)
b1(values['value'], values['gas_price'])
def b0(arr1, arr2):
return arr2.mean() - b1(arr1, arr2) * arr1.mean()
b_0 = b0(values['value'], values['gas_price'])
b_0
values['predictions'] = b_0 + values['value'] * b_1
values['predictions']
values.head()
First we transform the data into long format
to_plot = values.melt(id_vars=['date'], value_vars=['gas_price', 'predictions'], var_name='status')
to_plot.head()
Then, we plot the results.
chart = alt.Chart().mark_line().encode(
alt.X('date:T', axis=alt.Axis(format=("%x"), labelAngle=270)),
alt.Y('value:Q', axis=alt.Axis(format=",.2e"), scale=alt.Scale(type='log')), color=alt.Color('status:N')
).properties(width=600)
label = alt.selection_single(
encodings=['x'],
on='mouseover',
nearest=True,
empty='none'
)
alt.layer(
chart,
chart.mark_circle().encode(
opacity=alt.condition(label, alt.value(1), alt.value(0))
).add_selection(label),
alt.Chart().mark_rule(color='darkgray').encode(
x='date:T'
).transform_filter(label),
chart.mark_text(align='left', dx=5, dy=-5, stroke='black', strokeWidth=0.5).encode(
text=alt.Text('value:Q', format=',.2e')
).transform_filter(label),
chart.mark_text(align='left', dx=5, dy=-5).encode(
text=alt.Text('value:Q', format=',.2e')
).transform_filter(label),
data=to_plot
).properties(title='Gas prices and their predictions (log scale)', width=600, height=400)
It is a bit hard to assess the performance of the results just by looking at them, but at least it seems to generate values within the same ballpark. We could generate metrics as RMSE for reference.
Obvious improvements
- remove the outliers
- include past values