Decision TreesInformation Gain
Now that we understand entropy, teaching a Decision Tree is just a game of optimization. The algorithm's goal is to find the exact question that reduces the original "messiness" of the entire dataset as much as possible.
This drop in messiness—the difference between the entropy before a split and the entropy after a split—is called Information Gain.
The Intuition
Imagine you have a messy room (high entropy). If you ask the question, "Is this item clothing?", you might split the room into two slightly neater piles. However, if you ask the question, "Does this item belong in the trash?", you might instantly eliminate half the mess, creating much more organized piles.
The second question provides a much higher Information Gain. The algorithm tests out every single possible question it could ask, calculates the Information Gain for each, and greedily picks the one that organizes the data the fastest!
The Mathematics
To calculate Information Gain, we use the core ID3 algorithm. It is a top-down, recursive procedure that calculates the difference in entropy at each depth:
How does it work?
- Calculate the initial entropy of the dataset.
- Partition the data using every possible feature and cutoff value.
- For each potential split, compute the new entropy (using a weighted average of the resulting children branches).
- The split that provides the maximum becomes the new decision node.
Of course, reading the formal steps of an algorithm isn't always the easiest way to learn. Let's see it in action!
Recall the decision tree we were building earlier. Our very first decision node split the data based on the condition Diameter ≤ 0.45.
How did the algorithm magically know to pick that exact property and number? It did so by calculating and
In the interactive chart below, the black line tracks the Information Gain for every possible split value on the
Move the decision boundary left and right yourself. Notice how the highest point on the graph perfectly matches the algorithm's chosen split at Diameter = 0.45 (an Information Gain of 0.574).
A Note On Information Measures
An alternative to using entropy is the Gini impurity. Both metrics measure the purity of a dataset. While Decision Trees trained using either method often yield nearly identical results, Gini impurity is sometimes preferred in production environments because it calculates faster without needing to compute complex logarithms.