Decision TreesWhere To Partition?
We know a Decision Tree asks a sequence of questions to split data into smaller, more organized groups. But how does the algorithm know which question to ask first? To figure this out, it uses a powerful concept borrowed from information theory called Entropy.
The Intuition
Think of entropy as a measure of messiness or surprise.
Imagine you have a jar filled entirely with red jellybeans. If you reach in blindly and pull one out, you know immediately it will be red. There is no surprise, and no messiness. This is a state of low entropy (also known as being highly "pure").
Now imagine a jar filled with fifty different flavors of jellybeans. Pulling one out is a gamble. This state of high uncertainty and maximum messiness is called high entropy (or "impure").
A Decision Tree's entire goal is to take a highly messy, high-entropy dataset and keep splitting it until it creates neat, low-entropy groups where data points share the same class!
The Mathematics
In mathematics, we quantify entropy using probabilities. If we have a set of classes with probabilities , the total entropy is written as the negative sum of those weighted probabilities:
This equation guarantees three properties:
- if all but one of the are zero. The entropy vanishes when there is no uncertainty.
- reaches its absolute maximum when all probabilities are equal. This is the most uncertain or 'impure' situation.
- Any shift towards equalizing the probabilities increases the total entropy .
Let's test this intuition. Sort the items below based on whether they represent a state of low entropy (pure) or high entropy (impure).
Why does impurity matter?
Entropy helps us mathematically quantify the impurity of a collection of labeled data points. A node containing multiple classes is impure (high entropy), whereas a node containing only one class is pure (zero entropy).
Above, you can compute the exact entropy for a collection of labeled data points in a binary classification problem. Click the Add and Remove buttons to see how changing the mix of colors affects the math.
Notice how perfectly uniform samples have an entropy of zero, while an equal split results in the highest possible entropy value? The algorithm uses this exact metric to hunt down the ideal