Useful rules of thumb for bounding random variables (Part 1)

 

In a previous post we were discussing the pros and cons of parametric and non-parametric models, and how they can complement each other. In this post, we will add a little more into the story. More specifically, we are going to talk about bounds to the probability that a random variable deviates from its expectation. In these two posts we will talk about the following well-known techniques:

  1. Chebyshev’s Inequality
  2. Markov’s Inequality
  3. Chernoff’s Bounds

They exploit the knowledge on some feature of the random variable e.g. the variance of the distribution, or the random variable being obtained from a poisson process, or knowing that the random variable does not have negative values. You can find extensions for other properties of the random variable under “tail inequalities” or “deviation inequalities”. Techniques 1 and 2 were developed in the last part of the 1800’s. The latest is a bit more modern (the author is still alive!).

Why would you want to estimate bounds over the probability that a random variable deviates from its expectation? Some applications rely on expected values to make estimates/predictions.  And on many cases, this is a reasonable assumption to make. If you want to use expected values to represent a random variable, or as part of another technique to provide an output in a decision making process, then is sensible to provide some bounds on the probability that the random variable may deviate from the expectation.

If you are lucky enough to find a problem (or a simplification of a problem) in so that it satisfies all conditions necessary for all techniques, the order of “tightness” of the bounds is Chernoff-Chebyshev-Markov, being Chernoff the tightest. This is possibly the reason why, in spite of Chebyshev’s inequality being older, many textbooks choose to talk about Markov´s inequality first. It is not rare to see authors use Markov’s inequality in the proof for Chebyshev’s. Actually, is not rare at all to see Markov’s inequality while going thru proofs, simply because it requires so little from your random variable, so it became a staple in the pantry of statisticians.

Chebyshev’s Inequality

Take-home message: “The probability that a random variable is at least t standard deviations away from its expectation is at most 1/t^2”

Condition: We know the variance of the distribution.

Boring footnote: Some notes use interchangeably the distance from the expectation and the distance from the mean, which for a big enough number of samples becomes reasonable. I chose to use the expectation instead because the demostration of Chebyshev via Markov uses a random variable composed of the absolute value of the difference between your random variable and its expectation, so is easy to remember. But both are probably just as fine.

Markov’s Inequality

Take-home message: “The probability that a random variable takes values bigger than or equal to t times its expectation, is less than or equal to 1/t”.

Condition: The random variable must not take negative values.

Boring footnote:  It would make more pedagogical sense to start with Markov’s inequality, but I feel I need to start making some historical justice. If I find something older and just as useful as Chebyshev’s, I will probably make a newer post and scorn myself. Like this.

Ok, sorry about that link. Here, have some eyebleach.

Chernoff’s Bounds

Take-home messages:

“The probability that a random variable X built from the sum of independent Poisson Trials deviates to less than (1-δ) times their expectation μ, is less than exp(-μ δ ^2 / 2)”

“For the same random variable X, the probability of X deviating to more than (1-δ) times μ, is less than (exp(δ) / (1+δ)^(1+δ) ) ^ μ.

Conditions: The random variable must be a sum of independent Poisson trials.

This post will hopefully motivate you to study this topic on your own. If you do, some books you can check out:

 

The beautiful image for the preview is from here.

Book Chapter Review: If your model is mis-specified, are you better off?

https://stocksnap.io/photo/HN6OJPDCXD

This post is my interpretation of Chapter 10 of the book “Advanced Data Analysis from an Elementary point of view“. It is one of the most interesting reads I have found in quite some time (together with this).

Actually, the original title for the post was “Book Chapter review: Using non-parametric models to test parametric model mis-specifications”. But shorter titles tends to attract more viewers.

The first question that one might ask is “If we are going to use a non-parametric model to test a parametric model, why not going straight to non-parametric modelling instead?”.  Well, there are advantages to having a parametric model if you can build one. It could be of interest for your application to express your process in terms of known mathematical models. Also, a well specified parametric model can converge faster to the true regression function than a non-parametric model. Finally, if you only have a small number of samples, you could have better predictions by using a parametric model (even if slightly mis-specified). The reason is simply because parametric models tend to have a significantly smaller bias than non-parametric models. Also, for the same number of samples, a well specified parametric model is likely to have less variance in its predictions than a non-parametric model. Now, if you have a cheap and fast way to obtain many more samples, a non-parametric model can make better predictions than a mis-specified parametric model.

This is a consequence of the bias-variance trade-off that the author explains in a way that a person without a background in statistics can understand (in chapter 1.1.4 of the same book).

Non-parametric models “have an open mind” when building a model, while parametric models “follow a hunch”, if you will. One can find some similarities between modelling (parametric, non-parametric) and search algorithms, more specifically in uninformed search (BFS, DFS, Iterative deepening and variants) vs informed search (say, A* and variants). Even with a relaxed (and admissible) version of the optimal heuristic, A* is expected to traverse a shorter path than any uninformed search algorithm. However, it will traverse a larger path if the heuristic is poorly constructed, and most likely be outperformed by uninformed search. With this in mind, another question that may tickle you is: Can one translate a modelling problem into a search problem and have a machine automatically and optimally find the best parametric model for a given problem? Oh, yes (you can leave that as background music for the day if you like). You will of course need to express the problem in a way that the program terminates before our sun dies, and an admissible heuristic also helps. But yeah, you can do that. Humans often solve harder problems than they give themselves credit for.

If you were to build such a machine, the author can give you some suggestions to check for errors in your model. For instance, he suggests that if you have a good reason to think that the errors in a model can ONLY come from certain mis-specifications (say, that instead of being Y= θ1 X + ε it can only be Y=θ1 X + θ2 X + ε or maybe a couple other forms) then it may probably be faster and less sample-hungry for you to simply check whether the estimated θ2 is significantly different from 0, or whether the residuals from the second model are significantly smaller than those from the first. However, when no good reason is available to argue for one or other source of mis-specification, you can use non-parametric regression to check for all sources by doing either one of the following:

  • If the parametric model is right, it should predict as well as, or even better than the non-parametric one. So, you can check if the difference between Mean Squared errors of the two estimators is small enough.
  • If the parametric model is right, the non-parametric estimated regression curve should be very close to the parametric one. So, you can check that the distance between the two regression curves is approximately zero in all points.
  •  If the parametric model is right, then its residuals should be patternless and independent of input features. So, you can apply non-parametric smoothing to the parametric residuals and see if their expectation is approximately zero everywhere.

For the last method, I can elaborate on the explanation from the author. If the residuals are Y-f(x;θ), then the expected value of the residuals given an input X is E[Y-f(x;θ)|X] (the author did not made it explicit, but I assume that the residuals must be calculated with x ⊆ X. I could be wrong on this, so please correct me if I am). Now, being our typical regression model something in the lines of Y=f(X;θ)+ε, we substitute this in the expectation, and we end up with E[f(x;θ)+ε-f(x;θ) | X]. In this expression, we have that f(x;θ)+ε-f(x;θ) = ε, so you end up with E[ε|X]. Since the constant is independent from X, then the expression becomes E[ε|X] = E[ε]. Since the expected value of a constant ε will always be the constant, then E[ε]=ε. And with a significantly small enough ε, we can say that ε ≅ 0, so no matter what input X we have, the expected value of the residuals of the predictor should be approximately equal to zero.

So yes, under some circumstances (too little samples) you can actually be better off with a slightly mis-specified (i.e. relaxed) model, than with a full non-parametric model. And yes, you can indeed check if the assumptions for your model are actually valid.

Have fun, and see you in the next post!