Built with Mathigon

Glossary

Select one of the keywords on the left…

Decision TreesWhere To Partition?

Reading time: ~10 min

We know a Decision Tree asks a sequence of questions to split data into smaller, more organized groups. But how does the algorithm know which question to ask first? To figure this out, it uses a powerful concept borrowed from information theory called Entropy.

The Intuition

Think of entropy as a measure of messiness or surprise.

Imagine you have a jar filled entirely with red jellybeans. If you reach in blindly and pull one out, you know immediately it will be red. There is no surprise, and no messiness. This is a state of low entropy (also known as being highly "pure").

Now imagine a jar filled with fifty different flavors of jellybeans. Pulling one out is a gamble. This state of high uncertainty and maximum messiness is called high entropy (or "impure").

A Decision Tree's entire goal is to take a highly messy, high-entropy dataset and keep splitting it until it creates neat, low-entropy groups where data points share the same class!

The Mathematics

In mathematics, we quantify entropy using probabilities. If we have a set of classes with probabilities (p_1, p_2, ..., p_n), the total entropy H is written as the negative sum of those weighted probabilities:

\displaystyle \mathbf{H} = -\sum_{i=1}^{n} \boldsymbol{p}_i \log_2(\boldsymbol{p}_i)

This equation guarantees three properties:

  1. H = 0 if all but one of the p_i are zero. The entropy vanishes when there is no uncertainty.
  2. H reaches its absolute maximum when all probabilities p_i are equal. This is the most uncertain or 'impure' situation.
  3. Any shift towards equalizing the probabilities (p_1, p_2, ..., p_n) increases the total entropy H.

Let's test this intuition. Sort the items below based on whether they represent a state of low entropy (pure) or high entropy (impure).

A bag of only red marbles
A coin with "Heads" on both sides
A thoroughly shuffled deck of cards
A bowl of mixed nuts
Pure (Low Entropy)
Impure (High Entropy)

Why does impurity matter?

Entropy helps us mathematically quantify the impurity of a collection of labeled data points. A node containing multiple classes is impure (high entropy), whereas a node containing only one class is pure (zero entropy).

Above, you can compute the exact entropy for a collection of labeled data points in a binary classification problem. Click the Add and Remove buttons to see how changing the mix of colors affects the math.

Notice how perfectly uniform samples have an entropy of zero, while an equal split results in the highest possible entropy value? The algorithm uses this exact metric to hunt down the ideal . We call this measurement of improvement the Information Gain.

Sina