Data scale normalization
Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.
Here we demonstrate how this is done with pandas and altair.
Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]
import altair as alt
# alt.renderers.enable('default')
alt.renderers
from vega_datasets import data
We use the Gapminder health and income dataset
health_income = data('gapminder-health-income')
health_income.head()
income_domain = [health_income['income'].min(), health_income['income'].max()]
health_domain = [health_income['health'].min(), health_income['health'].max()]
alt.Chart(health_income).mark_point().encode(
alt.X('income:Q', scale=alt.Scale(domain=income_domain)),
alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),
alt.Size('population:Q'),
alt.Tooltip('country:N')
).properties(height=600, width=800)
The process:
- Take the values' difference from the smallest one;
- Take the value range, that is, the difference between the largest and smallest values;
- Divide the reduced values with the range.
$ \text {scaled value} = \frac{value - min} {max - min} $
The first step ensures that the smallest value will become 0. Dividing the reduced values by the range 'compresses' the values so the new maximum becomes 1.
quantitative_columns = ['income', 'health', 'population']
The original minimum and maximum values
health_income.loc[health_income[quantitative_columns].idxmin(), :]
minimums = health_income[quantitative_columns].min()
minimums
health_income.loc[health_income[quantitative_columns].idxmax(), :]
maximums = health_income[quantitative_columns].max()
maximums
Difference of values from the column minimum
health_income[quantitative_columns] - minimums
Value ranges: the difference between the maximum and the minimum
maximums - minimums
Let's normalize the dataset
def normalize_dataset(dataset, quantitative_columns):
dataset = dataset.copy()
minimums = dataset[quantitative_columns].min()
maximums = dataset[quantitative_columns].max()
dataset[quantitative_columns] = (dataset[quantitative_columns] - minimums) / (maximums - minimums)
return dataset
normalized_health_income = normalize_dataset(health_income, quantitative_columns)
normalized_health_income
The new minimum and maximum values
normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmin(), :]
normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmax(), :]
Plotting the normalized data, we got the same results, but with the income
, health
, and population
scales all normalized to the [0, 1] range.
alt.Chart(normalized_health_income).mark_point().encode(
alt.X('income:Q',),
alt.Y('health:Q'),
alt.Size('population:Q'),
alt.Tooltip('country:N')
).properties(height=600, width=800)