If you have an imbalanced dataset, “you can change the dataset that you use to build your predictive model to have more balanced data.
“This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:
- “You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or
- “You can delete instances from the over-represented class, called under-sampling…
“…These approaches are often very easy to implement and fast to run. They are an excellent starting point.
“You can learn a little more in the the Wikipedia article titled “Oversampling and undersampling in data analysis.
“Some Rules of Thumb
- “Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)
- “Consider testing over-sampling when you don’t have a lot of data (tens of thousands of records or less)…” (8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset by Jason Brownlee on August 19, 2015, emphasis added on what I believe is the main answer to my question)
Note that the author of the article lists several other methods to help with imbalanced datasets if you are interested in that.
.
Disclaimer:
I am not a professional in this field, nor do I claim to know all of the jargon that is typically used in this field. I am not summarizing my sources; I simply read from a variety of websites until I feel like I understand enough about a topic to move on to what I actually wanted to learn. If I am inaccurate in what I say or you know a better, simpler way to explain a concept, I would be happy to hear from you :).