Random prediction
When we try to build a prediction algorithm, it is a useful practice to first try out a baseline algorithm and see how they perform. Here we will describe a case of random prediction.
(Inspiration and examples are from Jason Brownlee's Machine Learning Algorithms from Scratch book.)
import pandas as pd
import altair as alt
import numpy as np
from vega_datasets import data
np.random.seed(42)
For the examples we will use the vega 'volcano' dataset. The width
and height
values are constant, so we work only with the values
column.
volcano = data('volcano')
volcano.head()
The random prediction algorithm
- takes a training and a test set
- generates the prediction by selecting random elements from the training set
We split the dataset to training and test sets, by taking the first 2/3 and last 1/3 of the data respectively.
train, test = volcano.iloc[: volcano.shape[0] * 2 // 3, :], volcano.iloc[volcano.shape[0] * 2 // 3 :, :]
A small check that the split did not leave out an entry or did not result in an overlap.
assert train.index[-1] + 1 == test.index[0]
We plot the training and the test sets, respectively.
alt.Chart(train.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Train data')
alt.Chart(test.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Test data')
First, we generate the random predictions with replacement. That is, the same values can occur more than once.
predictions = np.random.choice(train['values'], size=test.shape[0], replace=True)
predictions
We calculate the error terms
errors = test['values'] - predictions
errors
We calculate the root mean squared error of the predictions.
def calculate_rmse(observed, predicted):
return np.sqrt(sum((observed - predicted) ** 2)/len(observed))
The RMSE is around 34.59 which stands for about 1.34 STD.
rmse = calculate_rmse(test['values'], predictions)
rmse
For reference, if we would simply use the simple mean, we would get an RMSE of 18.42,
calculate_rmse(test['values'], test['values'].mean())
Let's plot the results
to_compare = pd.concat(
[
test['values'].rename('observed').reset_index(drop=True),
pd.Series(predictions).rename('predictions').reset_index(drop=True)
], axis=1
).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})
to_compare
alt.Chart(to_compare).mark_line().encode(
alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
).properties(width=1200, title='Predictions with replacement')
We put the steps into a function and rerun the prediction without replacement.
def predict_randomly(train, test, replace=True):
predictions = np.random.choice(train, size=test.shape[0], replace=True)
errors = test - predictions
rmse = calculate_rmse(test, predictions)
return predictions, rmse
predictions, rmse = predict_randomly(train['values'], test['values'], replace=False)
rmse
Finally, we also put the plotting steps into a function
def plot_predictions(observed, predicted):
to_compare = pd.concat(
[observed.rename('observed').reset_index(drop=True), pd.Series(predicted).rename('predictions').reset_index(drop=True)], axis=1
).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})
chart = alt.Chart(to_compare).mark_line().encode(
alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
).properties(width=1200)
chart.display()
The predictions without replacement
plot_predictions(test['values'], predictions)