Looking into Maximum Spacing Estimation (MSP) & ML.

The maximum spacing estimation (MSE or MSP) is one of those not-so-known statistic tools that are good to have in your toolbox if you ever bump into a misbehaving ML estimation. Finding something about it is a bit tricky, because if you look for something on MSE, you will find “Mean Squared Error” as one of the top hits. The wikipedia page will give you a pretty good idea, so click here to check it while you are at it.

Here is a summary for the lazy: MSE or Maximum Square Estimation is about getting the parameters of a DF so that the geometric mean of “spacings” in the data are maximized. Such “spacings” are the differences between the values of the cumulative distribution function at neighbouring data points. This is also known as the Maximum Product of space estimations or MSP, because that is exactly how you calculate it. The idea is choosing the parameter values that make the observed data as uniform as possible, for a specific quantitative measure of uniformity.

So we can explain what happens here by means of an exercise. Of which we made a notebook on our github (click here). Suppose we have a distribution function. We start with the assumption of a variable X with a CDF $F(x;\theta_0)$, where $\theta_0 \in \Theta$ is an unknown parameter to be estimated, and from which we can take iid random samples. The spacings over which we will estimate the geometric mean ($D_i$) are the the differences between $F(x(i);\theta)$ and $F(x(i-1);\theta)$, for i [1,n+1]. For the giggles, let’s say our df is a Pareto I, with some shape parameter α=3.0 and a left limit in 1. This α is the θ parameter that we intend to estimate. We can draw some samples from our distribution and construct a CDF out of the samples. We can do a similar process for other values of shapes, and plot those CDFs together, which will end up looking like this:

paretoI

The first thing you will observe is that the bigger the alpha, the “closer to each other” this distributions look. For instance, the CDFs for α=5.0 and α=2.5 are much more closer to each other than the CDFs for α=2.5 and α=1.0. And that is an interesting fact to consider when using this estimator. It will probably be easier to get a “confused” result the higher the α parameter gets. So α=3.0 is not exactly the easiest choice of values for “messing around and looking at how our estimator behaves”, but is not too difficult either.

Now, back to the estimator. In this post we made a very simple exercise to look at how ML and MSE behave when estimating a value of an α parameter in a Pareto I distribution. The choice of shape parameter and distribution are completely arbitrary. In fact, I encourage you to take my code and try other distributions and other values yourself. Also to make my small laptop’s live easier, selected a subset of α values over which the search was made. Then, we obtained the best scores on each method for different sample sizes:

10,50,100,300,500,10000

And for each sample size and each method, we repeated the estimation 400 times. So, a box-and-whisker plot of this estimation together with a scatterplot for each sample size and each method can be seen here:

scatter

The dotted line in α=3.0 is where the shape parameter of each original sample is. As you can see here, ML behaved much better than MSE for small sample numbers. In both cases for small numbers of samples there was some skewness towards the bigger values. This is expected since the CDFs of this distribution get closer the higher the value of the shape parameter. ML also behaved better than MSE for 50, 100 and 300 samples. Now, this was only one result for this distribution and for α=3.0. A more definitive quantitative evaluation would require looking at more distributions, and at different points of those distributions. There was a paper suggesting that “J” shaped distributions were the strong point of MSE. I guess a more thorough quantitative valuation of MSE vs ML would be in order for a decision here. And it will also be worthy of a journal paper in addition to a small blog post such as this one ;). You can cite me too if you like to use my code ;).

When I first found this estimator I have to admit that it caused a bit of infatuation in me. Some mathematical concepts carry themselves with such beauty that you can’t help feeling an attachment. And you see everything about them with rose-tinted lenses. It made me wonder why on earth is this concept not as well known as ML. It is not super much more expensive depending on how you calculate it. For ML you need a density function, while for MSE all you need is a cumulative distribution function. And that alone is a powerful point, because all random variables have a CDF, but not all have a PDF. Granted, most distributions you will work with will probably have a PDF. Granted, ML is the workhorse for estimators, many toolboxes have it implemented. And is super easy to teach to undergrads. And it works well. In fact, it worked better than MSE for this particular example. So in spite of all the beauty, maybe the fitness function for mathematical concepts to last posterity is not beauty or elegance, but applicability.

BONUS POINTS IF YOU ARE LOOKING TO WORK WITH US: Blow my mind. Try this exercise in other distributions. Make other comparisons. Make me stop loving MSE. Or make me love it even more. I have to do something about the butterflies in my stomach! You are my only hope!

As always, you can find the code for generating the plots of this post in our notebook (click here).

Image taken from stocksnap.io.

Trying out Copula packages in Python – II

And here we go with the copula package in (the sandbox of) statsmodels! You can look at the code first here.

I am in love with this package. I was in love with statsmodels already, but this tiny little copula package has everything one can hope for!

suddenly the world seems such a perfect place
Summarizing my feelings about this package

First Impressions

First I was not sure about it. It looks deceptively raw, so one can understand why it would not be fair to compare this with other packages in statsmodels. After googling for examples, none could be found. Not even in the documentation of statsmodels. In fact, to find that this piece of code even existed you had to dig deep.

There are no built-in methods to calculate the parameters of the archimedean copulas, and no methods for elliptic copulas (they are not implemented). However, elliptic copulas are quite vanilla and you can implement the methods yourself. We missed the convenience of selecting a method for transforming your data into uniform marginals, but you can also implement that yourself. You could either choose to fit some parameters to a scipy distribution and then use the CDF method of that function over some samples, or work with an empirical CDF. Both methods are implemented in our notebook.

So in order to actually use the functions in this package, you have to write your own code for getting parameters for your archimedean (we borrowed some code from the copulalib package for that purpose), for transforming your variables into uniform marginal, and for actually doing anything with the copula. However as it is, is quite flexible. Is good that the developers decided to keep it anyway.

Hands on!

Allright, check out our notebook at github.

Trying out Copula packages in Python – I

You may ask, why copulas? We do not mean this copulas. We mean the mathematical concept. Simply put, copulas are joint distribution functions with uniform marginals. The kicker, is that they allow you to study dependencies separately from marginals. Sometimes you have more information on the marginals than on the joint function of a dataset, and copulas allow you to build “what if” scenarios about the dependency. Copulas can be obtained by fitting a joint distribution to the uniformly distributed margins obtained from the quantile transformation on the cdf of your variable of interest.  For more on them, do check out chapter 7 of this slides.

This post is about Python (with numpy, scipy, scikit-learn, StatsModels and other good stuff you can find in Anaconda) but R is fantastic for statistics. I repeat: fan-tas-tic. If you are serious about working with statistics, it doesn’t matter whether you like R or not, you should at least check it out, and see what packages are there to help you. Chances are, someone may have built what you need. And you can work R from python (it needs some setup).

So of course there is an R package for working with copulas named -with all logic- “copula”. The package was built by Marius Hofert, Ivan Kojadinovic, Martin Maechler, and Jun Yan, and maintained by Martin Maechler.

With all of this about how wonderful R is, we are still making a post about how to work with a particular mathematical tool in python. Because as awesome as R is, python does have an incredible flexibility for otter other matters.

Most of the upcoming content in this post will be built with Jupyter Notebooks. I am a recent user and I love them!. StatsLetters will keep some notebooks with python/ R examples in github.

The packages

I was surprised by the fact that there was no explicit implementation of copula packages in scikit-learn (huh?) or scipy (gasp!). However, statsmodels does have an implementation going in their sandbox, which we will try out on a future post:

statsmodels.sandbox.distributions.copula

The package we will try today is copulalib. Is available in anaconda, and you can pip it. It has a small bug that you can fix yourself, and in the notebook the bug and the small fix are described. I contacted the author of the package, and let him know about this, but I am not sure if the package is still maintained, so I decided to anyway include the bugfix in the notebook.

And with no more introduction, click here to enjoy our notebook!