Built with Mathigon

Glossary

Select one of the keywords on the left…

Random ForestVariance in Composition

Reading time: ~10 min

In the previous chapter, we learned that a single Decision Tree suffers from high variance, meaning it is incredibly unstable and sensitive to minor data changes.

However, in a Random Forest, this "flaw" becomes a massive superpower! The inventor of the Random Forest model, Leo Breiman, proved that the key to a strong forest is having trees that are as different from each other as possible. If every tree made the exact same mistakes, the group vote wouldn't help. By having high variance, the trees have very low correlation—meaning they offer beautifully diverse perspectives.

The Bagging Method and Feature Selection (which you just learned about above) are the two genius tricks the algorithm uses to artificially force these trees to be different from one another.

We just constructed a Random Forest on our street sign dataset. But just how different are the individual trees in reality? To find out, we've trained a nine-tree Random Forest and plotted it for you below.

Hover your mouse over the inner circles (which represent the individual trees) to see their unique accuracies and feature importance scores:

Notice how chaotic the individual trees seem? There is almost no obvious pattern! Some trees only care about two out of the four features.

Test your knowledge: Based on the chart above, which entity achieves the absolute highest accuracy?

The Random Forest model as a whole (the outer circle).
Tree #4, because it perfectly memorized the size feature.
All of the inner trees have exactly the same accuracy.

Embracing Diversity

The red dots on the right confirm it: The Random Forest outperforms every single individual tree. This proves that a group of weak, highly chaotic models can combine to create a mathematically superior . The wide range of unique feature importance scores across our trees is exactly what makes the forest so powerful!

Variance in Predictions

If every tree produced the exact same prediction, having a forest wouldn't improve our accuracy at all! For a Random Forest to succeed, the trees need to make different mistakes.

Below, each circle represents a single prediction. Treat each row as a different tree, and each column as a specific data point from our test set. Blue means "Yes" and pink means "No". A solid color means the tree guessed correctly, but a striped pattern means the tree got it wrong.

The highly irregular pattern in the grid shows that the individual trees are making mistakes in completely different places. This is exactly what we want! Because their errors are comfortably spread out, the overall Random Forest (which averages their votes) can easily achieve the highest total accuracy.

However, notice the very first column on the left. Even though three individual trees managed to guess correctly, the majority of the forest was wrong, so the final prediction failed. This realization actually inspired researchers to create entirely new methods, like completely sequential models called "Boosting", which are incredibly popular today.

To reveal more content, you have to complete all the activities and exercises above. 
Are you stuck? or reveal all steps

Next up:
Conclusion
Sina