Decision TreesInformation Gain

Now that we understand entropy, teaching a Decision Tree is just a game of optimization. The algorithm's goal is to find the exact question that reduces the original "messiness" of the entire dataset as much as possible.

This drop in messiness—the difference between the entropy before a split and the entropy after a split—is called Information Gain.

The Intuition

Imagine you have a messy room (high entropy). If you ask the question, "Is this item clothing?", you might split the room into two slightly neater piles. However, if you ask the question, "Does this item belong in the trash?", you might instantly eliminate half the mess, creating much more organized piles.

The second question provides a much higher Information Gain. The algorithm tests out every single possible question it could ask, calculates the Information Gain for each, and greedily picks the one that organizes the data the fastest!

The Mathematics

To calculate Information Gain, we use the core ID3 algorithm. It is a top-down, recursive procedure that calculates the difference in entropy at each depth:

How does it work?

Calculate the initial entropy of the dataset.
Partition the data using every possible feature and cutoff value.
For each potential split, compute the new entropy (using a weighted average of the resulting children branches).
The split that provides the maximum becomes the new decision node.

Of course, reading the formal steps of an algorithm isn't always the easiest way to learn. Let's see it in action!

Recall the decision tree we were building earlier. Our very first decision node split the data based on the condition Diameter ≤ 0.45.

How did the algorithm magically know to pick that exact property and number? It did so by calculating and the Information Gain!

In the interactive chart below, the black line tracks the Information Gain for every possible split value on the Diameter feature.

Move the decision boundary left and right yourself. Notice how the highest point on the graph perfectly matches the algorithm's chosen split at Diameter = 0.45 (an Information Gain of 0.574).

A Note On Information Measures

An alternative to using entropy is the Gini impurity. Both metrics measure the purity of a dataset. While Decision Trees trained using either method often yield nearly identical results, Gini impurity is sometimes preferred in production environments because it calculates faster without needing to compute complex logarithms.

Sign in to Curvingo

Share

Send us Feedback

Thanks for your feedback!

Reset Progress

Glossary

Decision TreesInformation Gain

The Intuition

The Mathematics

A Note On Information Measures