Trying out Copula packages in Python – I

You may ask, why copulas? We do not mean this copulas. We mean the mathematical concept. Simply put, copulas are joint distribution functions with uniform marginals. The kicker, is that they allow you to study dependencies separately from marginals. Sometimes you have more information on the marginals than on the joint function of a dataset, and copulas allow you to build “what if” scenarios about the dependency. Copulas can be obtained by fitting a joint distribution to the uniformly distributed margins obtained from the quantile transformation on the cdf of your variable of interest.  For more on them, do check out chapter 7 of this slides.

This post is about Python (with numpy, scipy, scikit-learn, StatsModels and other good stuff you can find in Anaconda) but R is fantastic for statistics. I repeat: fan-tas-tic. If you are serious about working with statistics, it doesn’t matter whether you like R or not, you should at least check it out, and see what packages are there to help you. Chances are, someone may have built what you need. And you can work R from python (it needs some setup).

So of course there is an R package for working with copulas named -with all logic- “copula”. The package was built by Marius Hofert, Ivan Kojadinovic, Martin Maechler, and Jun Yan, and maintained by Martin Maechler.

With all of this about how wonderful R is, we are still making a post about how to work with a particular mathematical tool in python. Because as awesome as R is, python does have an incredible flexibility for otter other matters.

Most of the upcoming content in this post will be built with Jupyter Notebooks. I am a recent user and I love them!. StatsLetters will keep some notebooks with python/ R examples in github.

The packages

I was surprised by the fact that there was no explicit implementation of copula packages in scikit-learn (huh?) or scipy (gasp!). However, statsmodels does have an implementation going in their sandbox, which we will try out on a future post:


The package we will try today is copulalib. Is available in anaconda, and you can pip it. It has a small bug that you can fix yourself, and in the notebook the bug and the small fix are described. I contacted the author of the package, and let him know about this, but I am not sure if the package is still maintained, so I decided to anyway include the bugfix in the notebook.

And with no more introduction, click here to enjoy our notebook!


Sympathy for the Extreme


A handful of times in our lives, we can’t take our eyes out of something. Something so rare that we just want to keep our eyes on it one last instant before it vanishes. And when it does, the air slowly leaves our lungs and we wonder when will we ever experience something like that again.



What ?

Clears throat what I actually mean is that every now and then, a data science practitioner will be tasked with making sense out of rare, extreme situations.

The good news is, there exist mathematical tools that can help you make sense of extreme events. And some of those tools are structured under a branch of probability which has (conveniently) been named Extreme Value Theory (EVT).

Extreme value theory is concerned with limiting laws for extreme values in large samples. And this is a very important point: it requires LARGE amounts of samples, otherwise the errors for our estimations may become arbitrarily big.

How does that work?

An important concept on top of which extreme value theory builds upon, is the idea of a Maximum Domains of Attraction (MDA).

An analogy useful to explain MDAs is the idea of the Central Limit Theorem (CLT). If we recall, according to CLT, if you split a sequence into chunks of size n, the distribution of the means of those chunks will be gaussian. And the mean of such distribution (a.k.a. the distribution of sample means) will converge to the mean of the original sequence. For MDA, given the same sequence split into chunks, one can extract the maxima of each chunk and build a normalised sequence out of it (lets call that sequence Mn).

The super awesome result here, is that one can extract a CDF of normalizing sequences so that (Mn – dn)/cn converges in a certain distribution F^n (cn*x + dn) -> H(x). And that CDF named Fn ∈ MDA of H. All common continuous distributions are in the MDA of a GEV distribution. The restrictions on the “normalisers” are that cn>0. Under certain circumstances, cn and dn have simple analytical forms.

The awesomeness does not stop there. H follows a very specific type of distribution. And guess what. That distribution is NOT a Gaussian.


Now, remember all the ranting in previous posts about overusing gaussians?

Let me gorge a bit on that. Because this is a paradigm shift. Usually, your average practitioner will run, fit a gaussian, choose a multiple of a standard deviation, and claim “hic sunt dracones” (which is an oversimplification of univariate statistic process control). There seems to be nothing wrong with that right? except that it does not make much logical sense to try to define the extreme by defining the normal, if you think about it. I mean, you know quite a lot about what is normal. And you know that the extreme is not normal, but that’s about it. Without extreme value theory, you will otherwise have a huge question mark.

All right, back to the subject matter. Such distribution for H is known as the Generalized Extreme Value distribution (GEV). The GEV can be determined by two parameters: location and shape ξ. We can use any known techniques (e.g. MLE) to fit the parameters. The shape parameter ξ can used to describe the tail of H. The larger ξ -> Heavier tails.

  • For ξ > 0 case, H is a heavy tailed Fréchet.
  • For ξ < 0 case, H is a short tailed Wiebull.
  • For ξ = 0 case, H is an exponentially decaying Gumbel.

You can see the differences by looking at the following plot (obtained from here):


How do I use this?

There are two main methods: Block Maxima, and Peaks-over-Thresholds. For the first method, maximum points of selected blocks are used to fit the distribution. For the second, all values above a high level u are used. Each method can suit different types of need.

Block Maxima

The block maxima method is very sample hungry, since it consists on splitting the data in n-size chunks, and using only one element in each chunk. The choice of n is affected by bias variance tradeoff (bigger blocks diminishes bias, more blocks diminishes variance). Finally, there is no specific criteria for selecting n. However, on datasets in which a partition goes naturally, this could actually be used in lack of better information (for instance, when studying extremes in cyclic phenomena).


The Peaks-over-thresholds method is known to be less sample hungry. It consists on setting a threshold u, and using all points in the dataset above that level for constructing the model. And fortunately there exists a (graphical) statistic method for selecting u
via finding the smallest point of linearity in a sample mean excess plot. Once a model is fitted for threshold u, we can infer a model for a threshold v > u. Yet applying the graphical method is tricky since the sample mean excess plot is rarely visibly linear. One should always analise the data for several thresholds. And this method will not give a (proven) optimal choice.

So, what’s the veredict?

You don’t need to stand on the line between serendipity and vicissitude as long as you have sufficient data, thanks to the mathematical developments in extreme value theory. This blog intends to tease you so that you look more into this topic yourself, if this sounds like something that could help you in your professional life. The author recommends looking at the wonderful chapter on EVT at They also have examples in R, that you can play with yourself!

The featured image in this post was borrowed from here.

Book review: “The (mis)behaviour of Markets”

Let’s start by saying that this blog was NOT paid by the publisher or the author of this book to make this review. It was a mere consequence of my search engine knowing more or less what the writer is in the mood of reading at the right time of the week. And for the same reasons now there is one too many mason jars on my kitchen counter, but that is a completely different story.

The first time the I heard about Mandelbrot, I was a teenage girl in the process of finding her identity. So instead of doing this, this or this, the “thing” to do was having desktop backgrounds that no one else had. And the answer was a program that generated crude graphics using different types of fractals among which was the Mandelbrot set. Many hours and desktop backgrounds later it all went into the “stuff to remember for later” part of my long term memory, and 20 years later while attending a course on evolutionary computing things started clicking together. And 5 years after that course, my search engine shows me this book. “Mandelbrot eh? like the power-law and the terrains for games? what is that guy doing with markets?”. Unfortunately, he is dead. Since 2010. But the book was a good read.

A main goal of the book is to show you another way of thinking about randomness in finance. Especially, about the assumptions behind most financial methods. In fact it does show you another way of thinking about randomness in everything, but that is part of the charm. And the nicest thing about this book is HOW it takes you to that other point of view.

The book is divided in three parts, titled (respectively) “The Old Way”, “The New Way” and “The Way Ahead”. Part 1 and Part 2, the writer pulls you into the biographical and social context of the people behind most significant pillars of modern finance. The historical recaps felt very vivid. It was very easy to relate yourself to the people making those discoveries. So you could understand why things were made the way they were currently. And then, thru the whole book, you had that feeling in the stomach that something was about to happen. You get glances of the place where the book wants to take you, like mirages in the desert. I particularly enjoyed a metaphor of his in which the distribution of the location of the lost arrows of a blindfolded archer was used to compare the mild randomness that could be expressed with a Gaussian distribution vs the wild randomness that could be expressed with Cauchy.

Part 3 has only two chapters, and they can be read without the previous parts: “Ten Heresies of Finance”, in which he summarizes many points that have been scattered as breadcrumbs thru the previous sections, and “In the lab”, in which he presents the work of other people with similar observations as his.

In my opinion, the corner-stone message of this book is that markets behave like turbulent processes with bursts, pauses and fractally scaled parts, rather than gaussians, and critical events tend to cluster other small events around them and cause turbulence. Looking at markets as turbulent phenomena has interesting consequences. For instance, big gains and losses concentrate into smaller time slots. The biggest fortunes are made and lost with price variations right before and during such critical events. With this in mind, arbitraging becomes a more significant driving force in the market, so price differences become more interesting than average prices themselves. He also looked at the fact that prices change in leaps, rather than in smooth glides. So timing becomes quite important.

But the former observations also hint at the fact that risk has been underestimated in markets. Since market behaviors have busts and pauses that makes them vary more wildly than gaussians, working with “averages” alone in stock markets is indeed risky and inappropriate. It is more meaningful to work with out-of-the-average values when estimating risk.

And it does not stop there. Another strong statement on this book is that markets everywhere work alike, but endogenous and exogenous factors can make a difference. Bubbles appear as a consequence of interactions and turbulence, and patterns can change from a moment to the other. Markets can exhibit dependencies without correlation. For instance, the fact that the prices went down yesterday does not mean that they will fall today. But we could have the case in which having our prices plummet by “x” percent today will increase the odds of another “x” percent move later. So, we could have a strong dependency, without a correlation. Large changes tend to be followed by more large changes. Volatility clusters. Yet spurious patters appear everywhere, and that is just another consequence of turbulence and wild randomness. Humans tend to look for patterns and may find them even if they are not there.

For reference, the complete title of the book is “The Misbehaviour of Markets: A Fractal view of Risk, Ruin and Reward” by Benoit Mandelbrot and Richard L. Hudson. It has won a Financial Times award for the most innovative book in business and finance published worldwide.

I will now leave you with something to look at. Embrace the turbulence, and see you next post!

The featured image was taken from here.

Using higher moments to your advantage: Kurtosis

Or “can daedalean words actually help make more accurate descriptions of your random variable? Part 1: Kurtosis

Is a common belief that gaussians and uniform distributions will take you a long way.
Which is understandable if one considers the law of large numbers: with a large enough number of trials, the mean converges to the expectation. Depending on your reality, you may not have a large enough number of trials. You could always calculate an expected value out of that (limited) number of trials, and for certain conditions on your random variable, you could get a bound on the variation of that expectation, together with a probability of that bound. And that alone is a lot of information on how your random variable might behave.

But you could do a bit more with those expected value calculations! Especially if you have a reason to believe that the best fit for your random variable might not be normal, yet you simply don’t have enough samples to commit to a model right now. Could it be that the apparent misbehavior of a random variable is actually a feature instead of a bug?

Kurtosis: Peakyness, Broad-shoulderness and Heavy-tailedness

Kurtosis (we can use κ interchangeably to denote it) describes both the peakyness and the tailedness in a distribution. κ can be expressed in terms of the expectation:

β_2 = E(X-μ)^4 / (E(X-μ)^2)^2

Or in terms of the sample means (a.k.a. sample κ):

b_2 = (Sum((Xi- μ{hat})^4)/n) / (Sum((Xi-μ{hat})^2)/n)^2

Kurtosis represents a movement of mass that does not affect the variance and reflects the shape of a distribution apart from its variance. Is defined in terms of an even power of the standard score, so is invariant under linear transformations. Normal distributions have the nice property of having κ = 3. Take the following re-enactment of the wikipedia illustration for normal distributions:


As you may have guessed from the labels, we have plotted three gaussians with different σ^2 (0.2, 1, 5), all with mean zero. If we draw samples from these distributions and apply the sample κ expressions, we will in all cases get κ ~3. All these three distributions have the same Kurtosis. You can try with sample sizes as small as 9, and move all the way to 1000, 3000 and more. You will still get κ ~3. This is why there is a relative measure called Excess Kurtosis, which is the kurtosis of a distribution with respect that of a normal distribution. And as you may have guessed, is built with  κ – 3.

Distributions with negative  κ – 3 are known as platykurtic. The term “Platykurtosis” refers to the case in which a distribution is Platykurtic. Relative to the normal, platykurtosis occurs with light tails, relatively flatter center and heavier shoulders.

Distributions with excess kurtosis (with a positive κ – 3) are known as leptokurtic. Just as with Platykurtosis, the term “Leptokurtosis” refers to the case in which a distribution is Leptokurtic. When Leptokurtosis occurrs, heavier tails are often accompanied by a higher peak. Is easier to think of hungry tails eating variance from the shoulders/tips or “thinning” them (the greek word “Lepto” means “thin, slender”).

So excess kurtosis on most cases captures how much of the variance would go from the shoulders to the tails and to grow the peak (leptokurtosis) or from the tails and the peak height into the shoulders (platykurtosis). Leptokurtosis can either occur because the underlying distribution is not normal, or because outliers are present. So if you are sure that your underlying phenomenon is normal yet you experience leptokurtosis, then you can either re-evaluate your assumptions, or consider the presence of outliers.

Detecting Bimodality

According to De Carlo (first reference of this post) Finucan in his 1964 “A note on Kurtosis” noted that, since bimodal distributions can be viewed as having heavy shoulders, they should tend to have negative kurtosis. Moreover, Darlington in “Is kurtosis really ‘peakedness’?” (1970) argued that excess kurtosis can be interpreted as a measure of unimodality versus bimodality, in which with large negative kurtosis is related to bimodality, with the uniform distribution (κ = -1.2) setting a dividing point.

Discussing Darlington’s results and Finucan’s note would probably require another post… Let’s see how we go for that later :). For now, I think the following plot from wikipedia shows all the former behaviours really nicely:



Kurtosis for assessing normality

Gaussians are like sedans: you can drive them in the city, you can drive them in the highway, you can drive them to the country. They will take you thru many roads. But sometimes you need a proper truck. Or fabulous roller-skates. And knowing when would you need either can save you from toiling.

Thanks to the higher moments of a distribution, is possible to make relatively cheap tests for normality, even with really small sample sizes. If you wanted to compare the shape of a (univariate) distribution relative to the normal distribution, you can by all means use Excess Kurtosis. Just calculate sample kurtosis for the data you are studying. If it deviates significantly from 3, even for a small number of samples, then you are either not dealing with a gaussian distribution, or there is too much noise for such a sample size.

Multivariate normality & multivariate methods

The assumption of normality in the multivariate case prevails in many application fields, often without verifying how reasonable it would be in each particular case. There is a number of other conditions to check for multivariate normality. A property of multivariate normality is that all the marginals are also normal, so this should first be checked (this can be quickly performed with excess kurtosis). Also, linear combinations of the marginals should also be normal, squared Mahalanobis distances have an approximate chi-squared distribution (often q-q plots are used for this purpose), and we can go on. You could even use this web interface to an R package dedicated exclusively to check multivariate normality.

Kurtosis can affect your study more than you think. When applying a method for studying your random variable, keep in mind that some methods can be more or less affected by the higher moments of the distribution. Some methods are better at dealing with some skewdness (for another post) while some tests are more affected by it. Similarly with Kurtosis. Tests of equality of covariance matrices are known to be affected by kurtosis. Analyses based on covariance matrices (shout out PC Analysis!) can be affected by kurtosis. Kurtosis can affect significance tests and standard errors of parameter estimates.

Final words

The contents of this post have been influenced by:

Is too bad that I could not find a way to read or buy Finucan’s note on kurtosis, because it seemed interesting. If anyone has access to it, please comment and let me know how to get it.

On types of randomized algorithms

There is more than Monte Carlo when talking about randomized algorithms. It is not uncommon to see the expresions “Monte Carlo Approach” and “randomized approach” used interchangeably. More than once you start reading a paper or listening to a presentation, in which the words “Monte Carlo” appear on the keywords and even on the title, and as you keep reading/listening, you notice in the algorithm description a part in which the runs with incorrect outputs are discarded until only correct outputs are given… Which effectively turns a Monte Carlo into a Las Vegas. Running time is no longer deterministic-ish, and the algorithm will only provide correct answers. So, they say/write “Monte Carlo” here, “Monte Carlo” there, and when it comes to what actually happens, you are not really in Monte Carlo, and you might be in Vegas, baby.

You can do a small check yourself. Use your favorite search engine with “monte carlo simulation” and “las vegas simulation”. Depending on your search engine, you may get something looking more or less like this:
aprox 84 200 000 results (0,32 seconds) – monte carlo simulation

aprox   9 560 000 results (0,56 seconds) – las vegas simulation

Almost a whole order of magnitude more in relevant results from Monte Carlo, in almost half the time. Now, if you used something like google, you may or may not get relevant results in your first screen, depending on how well google knows you. Of course, there are the cases of “monte carlo simulation” that most likely refer to other topics e.g. Monte Carlo integration and the like, the same with “las vegas simulation“. And depending on how much your search engine knows your habits, you might not get any results of Las Vegas simulations right away.

And probably the next thing you may do, is to do a quick wikipedia search of “Las Vegas algorithm”. You may possibly read it, and find in the end “Atlantic City algorithm”. And maybe you may want to follow that little wikipedia rabbit hole, and end up reading about it. And then no one can save you.


The featured image in this post is from here.