Testing Your AIWhich K Should You Pick?
So far, we have looked at the absolute extremes of choosing our value. We can slice our data directly in half (), or go completely insane and create a fold for every single data point (LOOCV).
So, what is the best value to pick? To answer this, we need to balance Bias against Variance.
Managing Bias
In this specific context, "bias" simply refers to how much a model struggles because it wasn't given enough study material.
Because a simple Validation Split () permanently cuts half the data away from training, the AI is starved for information. This leads to wildly pessimistic and highly-biased grades. On the absolute flip side, LOOCV gives the AI access to almost 100% of the data during every test, giving us a practically unbiased grade.
So if you want to completely eliminate bias and get the most realistic grade possible, you theoretically need to crank your value
Surviving Variance
If high values are completely unbiased, why don't we always crank it up? Aside from the massive computing costs, increasing actually runs the risk of increasing variance!
When you set extremely high, every single model in your rotation is training on almost the exact same massive pool of data. Because the models are basically identical twins, if there is a weird quirk or outlier hidden in that dataset, every single model will fall for it.
By choosing a middle-ground value, the folds remain different enough from each other that they don't all blindly memorize the exact same mistakes!
The Goldilocks Zone
Data science is almost always about finding the practical middle-ground between two mathematical extremes. To strike the perfect balance between reducing bias and surviving variance, practitioners almost always default to or . It's the industry standard Sweet Spot!