Decision TreesThe Problem of Perturbations
There is no denying that Decision Trees are incredibly powerful. They are fast to train, handle outliers gracefully, and unlike complex "black box" models, they are wonderfully easy for humans to interpret.
However, they suffer from one massive flaw: instability.
A standard Decision Tree is extremely sensitive to tiny changes—or perturbations—in the training data. Imagine giving a student a slightly different practice exam and having them completely rewrite their entire study guide because of one new question. That's how a Decision Tree reacts to minor data changes!
Check for yourself: Simply shifting a random 5% of the training examples causes the algorithm to spit out a completely different tree structure:
Why Is Instability A Problem?
In their raw, completely unrestricted form, Decision Trees are simply too eager to please.
If left completely unchecked, the algorithm will ruthlessly slice and dice the data until every single leaf node achieves perfect purity. As we just learned, this leads to incredibly deep trees that are highly susceptible to overfitting. They memorize random noise instead of learning the actual underlying pattern.
Test your knowledge: How can we stop a Decision Tree from becoming a massive, overfitted mess?
Tree Pruning
We can force a tree to behave by pruning it. This means setting strict rules, such as limiting the maximum depth of the branches or requiring a minimum number of data points to exist within a leaf before the algorithm is allowed to stop.
But what about the issue of high variance and instability when the data changes? Unfortunately, that is an unavoidable side-effect of relying on a single Decision Tree.