Normalization is a common technique used in machine learning to render the scales of different magnitudes to a common range between 0 and 1.

Here we demonstrate how this is done with pandas and altair.

Original inspiration: (Jason Brownlee: Machine Learning Algorithms from Scratch)[https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/]

import altair as alt

# alt.renderers.enable('default')
alt.renderers
RendererRegistry(active='default', registered=['colab', 'default', 'html', 'json', 'jupyterlab', 'kaggle', 'mimetype', 'notebook', 'nteract', 'png', 'svg', 'zeppelin'])
from vega_datasets import data

We use the Gapminder health and income dataset

health_income = data('gapminder-health-income')
health_income.head()
country income health population
0 Afghanistan 1925 57.63 32526562
1 Albania 10620 76.00 2896679
2 Algeria 13434 76.50 39666519
3 Andorra 46577 84.10 70473
4 Angola 7615 61.00 25021974
income_domain = [health_income['income'].min(), health_income['income'].max()]
health_domain = [health_income['health'].min(), health_income['health'].max()]

alt.Chart(health_income).mark_point().encode(
    alt.X('income:Q', scale=alt.Scale(domain=income_domain)),
    alt.Y('health:Q', scale=alt.Scale(domain=health_domain)),
    alt.Size('population:Q'),
    alt.Tooltip('country:N')
).properties(height=600, width=800)

The process:

  1. Take the values' difference from the smallest one;
  2. Take the value range, that is, the difference between the largest and smallest values;
  3. Divide the reduced values with the range.

$ \text {scaled value} = \frac{value - min} {max - min} $

The first step ensures that the smallest value will become 0. Dividing the reduced values by the range 'compresses' the values so the new maximum becomes 1.

quantitative_columns = ['income', 'health', 'population']

The original minimum and maximum values

health_income.loc[health_income[quantitative_columns].idxmin(), :]
country income health population
32 Central African Republic 599 53.8 4900274
93 Lesotho 2598 48.5 2135022
105 Marshall Islands 3661 65.1 52993
minimums = health_income[quantitative_columns].min()
minimums
income          599.0
health           48.5
population    52993.0
dtype: float64
health_income.loc[health_income[quantitative_columns].idxmax(), :]
country income health population
134 Qatar 132877 82.0 2235355
3 Andorra 46577 84.1 70473
35 China 13334 76.9 1376048943
maximums = health_income[quantitative_columns].max()
maximums
income        1.328770e+05
health        8.410000e+01
population    1.376049e+09
dtype: float64

Difference of values from the column minimum

health_income[quantitative_columns] - minimums
income health population
0 1326.0 9.13 32473569.0
1 10021.0 27.50 2843686.0
2 12835.0 28.00 39613526.0
3 45978.0 35.60 17480.0
4 7016.0 12.50 24968981.0
... ... ... ...
182 5024.0 28.00 93394608.0
183 3720.0 26.70 4615473.0
184 3288.0 19.10 26779222.0
185 3435.0 10.46 16158774.0
186 1202.0 11.51 15549758.0

187 rows × 3 columns

Value ranges: the difference between the maximum and the minimum

maximums - minimums
income        1.322780e+05
health        3.560000e+01
population    1.375996e+09
dtype: float64

Let's normalize the dataset

def normalize_dataset(dataset, quantitative_columns):
    dataset = dataset.copy()
    
    minimums = dataset[quantitative_columns].min()
    maximums = dataset[quantitative_columns].max()

    dataset[quantitative_columns] = (dataset[quantitative_columns] - minimums) / (maximums - minimums)
    
    return dataset
normalized_health_income = normalize_dataset(health_income, quantitative_columns)
normalized_health_income
country income health population
0 Afghanistan 0.010024 0.256461 0.023600
1 Albania 0.075757 0.772472 0.002067
2 Algeria 0.097030 0.786517 0.028789
3 Andorra 0.347586 1.000000 0.000013
4 Angola 0.053040 0.351124 0.018146
... ... ... ... ...
182 Vietnam 0.037981 0.786517 0.067874
183 West Bank and Gaza 0.028123 0.750000 0.003354
184 Yemen 0.024857 0.536517 0.019462
185 Zambia 0.025968 0.293820 0.011743
186 Zimbabwe 0.009087 0.323315 0.011301

187 rows × 4 columns

The new minimum and maximum values

normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmin(), :]
country income health population
32 Central African Republic 0.000000 0.148876 0.003523
93 Lesotho 0.015112 0.000000 0.001513
105 Marshall Islands 0.023148 0.466292 0.000000
normalized_health_income.loc[normalized_health_income[quantitative_columns].idxmax(), :]
country income health population
134 Qatar 1.000000 0.941011 0.001586
3 Andorra 0.347586 1.000000 0.000013
35 China 0.096275 0.797753 1.000000

Plotting the normalized data, we got the same results, but with the income, health, and population scales all normalized to the [0, 1] range.

Maximum values

alt.Chart(normalized_health_income).mark_point().encode(
    alt.X('income:Q',),
    alt.Y('health:Q'),
    alt.Size('population:Q'),
    alt.Tooltip('country:N')
).properties(height=600, width=800)