Suppose you have a population that is divided up into various classes, such as “male, female” or “0-17 years old, 18-30 years old, 31-60 yeas old, 61+ years old.” If, in your population, each class has the same number of people/objects, taking a random sampling of the population should give you a “balanced dataset.” On the other hand, if you take a random sampling of a population that does not have the same amount of people/objects in each category, you will likely end up with an “imbalanced dataset.”
Imbalanced datasets can be problematic when working with machine learning problems because the program may just predict the most common category and still get a high degree of accuracy. For example, a program predicting the age of individuals in a middle school might learn that it is almost always right if it just predicts that each individual is under 18 years old.
.
Sources:
- 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset by Jason Brownlee on August 19, 2015
- Oversampling and undersampling in data analysis, Wikipedia, retrieved 6/3/2020
.
Disclaimer:
I am not a professional in this field, nor do I claim to know all of the jargon that is typically used in this field. I am not summarizing my sources; I simply read from a variety of websites until I feel like I understand enough about a topic to move on to what I actually wanted to learn. If I am inaccurate in what I say or you know a better, simpler way to explain a concept, I would be happy to hear from you :).