It’s well-known that many machine studying fashions can’t course of categorical options natively. Whereas there are some exceptions, it’s often as much as the practitioner to resolve on a numeric illustration of every categorical characteristic. There are some ways to perform this, however one technique seldom really useful is label encoding.
Label encoding replaces every categorical worth with an arbitrary quantity. For example, if we’ve got a characteristic containing letters of the alphabet, label encoding may assign the letter “A” a price of 0, the letter “B” a price of 1, and proceed this sample till “Z” which is assigned 25. After this course of, technically talking, any algorithm ought to have the ability to deal with the encoded characteristic.
However what’s the issue with this? Shouldn’t subtle machine studying fashions have the ability to deal with this sort of encoding? Why do libraries like Catboost and different encoding methods exist to take care of excessive cardinality categorical options?
This text will discover two examples demonstrating why label encoding will be problematic for machine studying fashions. These examples will assist us respect why there are such a lot of alternate options to label encoding, and it’ll deepen our understanding of the connection between information complexity and mannequin efficiency.
Probably the greatest methods to achieve instinct for a machine studying idea is to grasp the way it works in a low dimensional area and attempt to extrapolate the end result to larger dimensions. This psychological extrapolation doesn’t all the time align with actuality, however for our functions, all we want is a single characteristic to see why we want higher categorical encoding methods.
A Function With 25 Classes
Let’s begin by a primary toy dataset with one characteristic and a steady goal. Listed below are the dependencies we want:
import numpy as np
import polars as pl
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split