Framework of Dataset Shifts

 

Tidying Up the Framework of Dataset Shifts: The Example





I recently talked about the causes of model performance degradation, meaning when their prediction quality drops with respect to the moment we trained and deployed our models. In this other post, I proposed a new way of thinking about the causes of model degradation. In that framework, the so-called conditional probability comes out as the global cause.

The conditional probability is, by definition, composed of three probabilities which I call the specific causes. The most important learning of this restructure of concepts is that covariate shift and conditional shift are not two separate or parallel concepts. Conditional shift can happen as a function of covariate shift.

With this restructuring, I believe it becomes easier to think about the causes and it becomes more logical to interpret the shifts that we observe in our applications.

This is the scheme of causes and model performance for machine learning models:

In this scheme, we see the clear path that connects the causes to the prediction performance of our estimated models. One fundamental assumption we need to make in statistical learning is that our models are “good” estimators of the real models (real decision boundaries, real regression functions, etc.). “Good” can have different meanings, such as unbiased estimators, precise estimators, complete estimators, sufficient estimators, etc. But, for the sake of simplicity and the upcoming discussion, let’s say that they are good in the sense that they have a small prediction error. In other words, we assume that they are representative of the real models.

Post a Comment

0 Comments