When we try to build a prediction algorithm, it is a useful practice to first try out a baseline algorithm and see how they perform. Here we will describe a case of random prediction.

(Inspiration and examples are from Jason Brownlee's Machine Learning Algorithms from Scratch book.)

import pandas as pd
import altair as alt
import numpy as np

from vega_datasets import data

np.random.seed(42)


For the examples we will use the vega 'volcano' dataset. The width and height values are constant, so we work only with the values column.

volcano = data('volcano')

width height values
0 87 61 103
1 87 61 104
2 87 61 104
3 87 61 105
4 87 61 105

The random prediction algorithm

1. takes a training and a test set
2. generates the prediction by selecting random elements from the training set

We split the dataset to training and test sets, by taking the first 2/3 and last 1/3 of the data respectively.

train, test = volcano.iloc[: volcano.shape * 2 // 3, :], volcano.iloc[volcano.shape * 2 // 3 :, :]


A small check that the split did not leave out an entry or did not result in an overlap.

assert train.index[-1] + 1 == test.index


We plot the training and the test sets, respectively.

alt.Chart(train.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Train data')

alt.Chart(test.reset_index()).mark_line().encode(alt.Y('values'), alt.X('index:Q'), alt.Tooltip('values')).interactive().properties(title='Test data')


First, we generate the random predictions with replacement. That is, the same values can occur more than once.

predictions = np.random.choice(train['values'], size=test.shape, replace=True)
predictions

array([166, 179,  96, ..., 105, 157,  95])

We calculate the error terms

errors = test['values'] - predictions
errors

3538   -16
3539   -29
3540    54
3541    54
3542    56
..
5302    -8
5303     2
5304    -7
5305   -60
5306     2
Name: values, Length: 1769, dtype: int64

We calculate the root mean squared error of the predictions.

def calculate_rmse(observed, predicted):
return np.sqrt(sum((observed - predicted) ** 2)/len(observed))


The RMSE is around 34.59 which stands for about 1.34 STD.

rmse = calculate_rmse(test['values'], predictions)
rmse

34.93380641982711

For reference, if we would simply use the simple mean, we would get an RMSE of 18.42,

calculate_rmse(test['values'], test['values'].mean())

18.420942507684327

Let's plot the results

to_compare = pd.concat(
[
test['values'].rename('observed').reset_index(drop=True),
pd.Series(predictions).rename('predictions').reset_index(drop=True)
], axis=1
).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})

to_compare

index status values
0 0 observed 150
1 0 predictions 166
2 1 observed 150
3 1 predictions 179
4 2 observed 150
... ... ... ...
3533 1766 predictions 105
3534 1767 observed 97
3535 1767 predictions 157
3536 1768 observed 97
3537 1768 predictions 95

3538 rows × 3 columns

alt.Chart(to_compare).mark_line().encode(
alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
).properties(width=1200, title='Predictions with replacement')


We put the steps into a function and rerun the prediction without replacement.

def predict_randomly(train, test, replace=True):
predictions = np.random.choice(train, size=test.shape, replace=True)
errors = test - predictions
rmse = calculate_rmse(test, predictions)

return predictions, rmse

predictions, rmse = predict_randomly(train['values'], test['values'], replace=False)
rmse

35.90392934559341

Finally, we also put the plotting steps into a function

def plot_predictions(observed, predicted):

to_compare = pd.concat(
[observed.rename('observed').reset_index(drop=True), pd.Series(predicted).rename('predictions').reset_index(drop=True)], axis=1
).stack().reset_index().rename(columns={'level_1': 'status', 'level_0': 'index', 0: 'values'})

chart = alt.Chart(to_compare).mark_line().encode(
alt.X('index:Q'), alt.Y('values'), alt.Color('status'), alt.OpacityValue(0.7)
).properties(width=1200)

chart.display()


The predictions without replacement

plot_predictions(test['values'], predictions)