Studying people and what they purchase

Or “Serpent people”.

The old saying goes something on the lines of “you are what you eat”. On modern societies the sentiment shifts towards “you are what you buy”. Which may include food, of course.

One important difference since a decade ago is what can be measured about how a user consumes a product. Lately, it seems like every little thing that we do can be used to build a projection of ourselves from our habits. And in the end, that projection can be used to poke the reptilian parts of your reward circuitry so they release the right cocktail of hormones. A cocktail that makes you choose bright red over dull gray, reach for your wallet or click “I accept the terms and conditions”.

ghostConsent

Through the years, recipes for such cocktails have been perfected by different disciplines. As an educated consumer, actively tasting those recipes in modern products can be as interesting as wine tasting, minus the inebriation. This is the first of a series of posts intended to help you be more aware of your own reward circuitry by using interpretations that different algorithms build from observing your measurable actions.

Another intention with this series of blog posts is to show that the methods are not necessarily:

  • Absolute
  • Inherently objective
  • Infallible

And definitely NOT suitable to use blindly e.g. “press the Analytics Button and have the neural network tell me everything”*. If anyone promises that without disclosing any assumptions, make sure to ask LOTS of questions.

The “Serpent people” series will present some textbook representations suitable for modeling this problem, aspects that are better reflected on each one of them, and trying out different open-source libraries on different artificially generated models of “people according to what we know about them”.

Some of that material is already in a very fluid shape in this notebook if you can’t wait to play by yourself :). The representation in there is simply what I considered to be natural for the problem itself. I plan on elaborating that representation with classics such as Frequent Itemset Mining and Associative Classification. For those, you can start by checking out Chapter 10 of “Data Mining” by Mehmed Kantardzic.

And that’s the teaser for what will come. For now, I will leave you with this David Bowie Song. Granted, it’s “Cat People” instead of “Serpent People”, but pretty cool still.

See you next post!

/* This sentence is partially credited to one of my friends while discussing technology sales gore.

Looking into Maximum Spacing Estimation (MSP) & ML.

The maximum spacing estimation (MSE or MSP) is one of those not-so-known statistic tools that are good to have in your toolbox if you ever bump into a misbehaving ML estimation. Finding something about it is a bit tricky, because if you look for something on MSE, you will find “Mean Squared Error” as one of the top hits. The wikipedia page will give you a pretty good idea, so click here to check it while you are at it.

Here is a summary for the lazy: MSE or Maximum Square Estimation is about getting the parameters of a DF so that the geometric mean of “spacings” in the data are maximized. Such “spacings” are the differences between the values of the cumulative distribution function at neighbouring data points. This is also known as the Maximum Product of space estimations or MSP, because that is exactly how you calculate it. The idea is choosing the parameter values that make the observed data as uniform as possible, for a specific quantitative measure of uniformity.

So we can explain what happens here by means of an exercise. Of which we made a notebook on our github (click here). Suppose we have a distribution function. We start with the assumption of a variable X with a CDF $F(x;\theta_0)$, where $\theta_0 \in \Theta$ is an unknown parameter to be estimated, and from which we can take iid random samples. The spacings over which we will estimate the geometric mean ($D_i$) are the the differences between $F(x(i);\theta)$ and $F(x(i-1);\theta)$, for i [1,n+1]. For the giggles, let’s say our df is a Pareto I, with some shape parameter α=3.0 and a left limit in 1. This α is the θ parameter that we intend to estimate. We can draw some samples from our distribution and construct a CDF out of the samples. We can do a similar process for other values of shapes, and plot those CDFs together, which will end up looking like this:

paretoI

The first thing you will observe is that the bigger the alpha, the “closer to each other” this distributions look. For instance, the CDFs for α=5.0 and α=2.5 are much more closer to each other than the CDFs for α=2.5 and α=1.0. And that is an interesting fact to consider when using this estimator. It will probably be easier to get a “confused” result the higher the α parameter gets. So α=3.0 is not exactly the easiest choice of values for “messing around and looking at how our estimator behaves”, but is not too difficult either.

Now, back to the estimator. In this post we made a very simple exercise to look at how ML and MSE behave when estimating a value of an α parameter in a Pareto I distribution. The choice of shape parameter and distribution are completely arbitrary. In fact, I encourage you to take my code and try other distributions and other values yourself. Also to make my small laptop’s live easier, selected a subset of α values over which the search was made. Then, we obtained the best scores on each method for different sample sizes:

10,50,100,300,500,10000

And for each sample size and each method, we repeated the estimation 400 times. So, a box-and-whisker plot of this estimation together with a scatterplot for each sample size and each method can be seen here:

scatter

The dotted line in α=3.0 is where the shape parameter of each original sample is. As you can see here, ML behaved much better than MSE for small sample numbers. In both cases for small numbers of samples there was some skewness towards the bigger values. This is expected since the CDFs of this distribution get closer the higher the value of the shape parameter. ML also behaved better than MSE for 50, 100 and 300 samples. Now, this was only one result for this distribution and for α=3.0. A more definitive quantitative evaluation would require looking at more distributions, and at different points of those distributions. There was a paper suggesting that “J” shaped distributions were the strong point of MSE. I guess a more thorough quantitative valuation of MSE vs ML would be in order for a decision here. And it will also be worthy of a journal paper in addition to a small blog post such as this one ;). You can cite me too if you like to use my code ;).

When I first found this estimator I have to admit that it caused a bit of infatuation in me. Some mathematical concepts carry themselves with such beauty that you can’t help feeling an attachment. And you see everything about them with rose-tinted lenses. It made me wonder why on earth is this concept not as well known as ML. It is not super much more expensive depending on how you calculate it. For ML you need a density function, while for MSE all you need is a cumulative distribution function. And that alone is a powerful point, because all random variables have a CDF, but not all have a PDF. Granted, most distributions you will work with will probably have a PDF. Granted, ML is the workhorse for estimators, many toolboxes have it implemented. And is super easy to teach to undergrads. And it works well. In fact, it worked better than MSE for this particular example. So in spite of all the beauty, maybe the fitness function for mathematical concepts to last posterity is not beauty or elegance, but applicability.

BONUS POINTS IF YOU ARE LOOKING TO WORK WITH US: Blow my mind. Try this exercise in other distributions. Make other comparisons. Make me stop loving MSE. Or make me love it even more. I have to do something about the butterflies in my stomach! You are my only hope!

As always, you can find the code for generating the plots of this post in our notebook (click here).

Image taken from stocksnap.io.

Special Announcement: We are hiring!

 

image from stocksnap.io/photo/DDYC9U7O2P

 

Hi all! you may have been following the blog for some time, or this may be your first visit. But just the fact that you are here, taking some time to read and learn new things is awesome. Well, guess what: we want to hire awesome people who enjoy learning new things. More specifically about data science.

Now, what do we mean with “data science”? there are a number of tasks that every person working with data must have done: at some point you must have decided what kind of data you needed for your task; you must have had an idea for how to collect and transform that data to a form in which you can apply some nice math / statistics to it, and finally you must have found a way to summarize interesting aspects of the whole ordeal. 

Some people find different parts of this process more confortable than others. Some people are better at communicating and telling stories, while others are more into applying math over the data and see what comes out; others just love designing and optimizing experiments; others love building tools for data extraction and conditioning. And the thing is: all of this is data science. All of it. And when a company wants to hire a person in data science, is important to specify what do they want, because all these tasks require different sets of abilities which are very difficult to find in a single individual. 

We are putting together a team. The engineers we are looking for thrive as a team players, and have a genuine interest in data analysis, statistics and mathematics. If you are one of them, you have built a sufficiently good programming base (preferably in Python and/or R) which allows you to learn and test ideas by yourself. We want you to be free to mold your mind into what you want, and have a good time.

The Team Roles: 

Data Analyst: Is concerned with looking at the data and the context around the data, and telling a story by analysing both. This engineer knows what type of visualization is best to use, for which audience, and how to build it. This engineer can also choose the best statistics to summarise a dataset, and can help organizations build suitable Key Performance Indicators.

Data Engineer: Is concerned with extracting, molding, recommending and building hygiene methods and choosing computing frameworks to work with data. Data nowadays can be found on all kinds of formats and be stored in different ways. It can be streaming or offline. The needs of a client may have been satisfied by optimising how your programs process the data, or they may have pushed the envelope of what can be accomplished by CPUs, possibly making this engineer look into GPUs. This engineer is responsible with providing quality data over which data analysis and data science can be applied.

Data Scientist: Is concerned with building quantitative models out of data, setting and verifying the assumptions for the models, testing and maintaining models, and choosing data collection strategies and instruments. This engineer knows what things can go south very quickly when the underlying assumptions for a model are no longer valid, and is responsible for clearly communicating his team what those assumptions are.

Must-have qualifications:

  • M.Sc. in engineering with a focus on one or more of the following disciplines; statistics, mathematics, computer science, applied physics, mechatronics, electrical/electronic/nuclear engineering.
  • Fluent English
  • Experience in Python or R.

The Attitude:

  • You like having fun!
  • You are friendly!
  • You respect and trust your team as much as your own knowledge.
  • You shoot for the stars, yet can gracefully land on the moon if needed.
  • You learn on your own, yet you ask for help when you need to.
  • You don’t take all statements for facts: when reason exists, you verify ground truths and communicate your findings to those concerned.

Nice-to-have:

  • Experience from working in teams
  • Customer-oriented experience
  • Experience with community projects (Github, CRAN, StackExchange community, etc)

if this sounds a lot like you, do not hesitate to apply with your CV and cover letter here:

https://www.linkedin.com/jobs2/view/254519957

Kolmogorov-Smirnov for comparing samples (plus, sample code!)

The Kolmogorov-Smirnov test (KS test) is a test which allows you to compare two univariate, continuous distributions by looking at their CDFs. Such CDFs can both be empirical (two-sample KS) or one of them can be empirical, and the other one built parametrically (one-sample).

Client: Good Evening.

Bartender: Good evening. Rough day?

Client: I should have stayed in bed…

Bartender: Maybe we have just the right thing for you. How about a Kolmogorov-Smirnov?

Client: Make it two-sample, please.

The null hypothesis for the one-sample case is that the empirical distribution is drawn from the reference distribution (which is usually parametric). For the two-sample test, the null hypothesis is that the two samples were drawn from the same distribution.

What the actual value of the KS statistic is the highest of all differences between the CDFs in the test. An expression for this statistic is:

K_n = \sup_{x} |(F_{n}-F) (x)| (1)

Sometimes in the literature, it is not uncommon to see it expressed like this:

K_n = \sqrt{n} \sup_{x} |(F_{n}-F) (x)| (2)

Some of you may have spotted the similarity between this expressions, and the Glivenko-Cantelli theorem (a.k.a. the fundamental theorem of probability by some people). To refresh it a bit, here is Glivenko-Cantelli for you:

|F_{n}- F|_{\inf} =\sup_{x\in R} |F_{n}(x)-F(x)| \rightarrow 0 almost surely.

And notice the “almost surely”. And “almost surely” will have to do. Because this theorem is such a cornerstone of statistics. And other people have made some interesting discoveries around Glivenko-Cantelli. For instance, the DKW inequality draws bounds on the convergence of Glivenko-Cantelli by bounding the probability that the F_{n} differs from F by more than a given constant \epsilon > 0 in the reals. This result can be taken to the KS statistic, and we get an estimate for its tail. And then some people start building up even more interesting bounds. For instance, take this guys.

And well, if you simply want to start playing with the KS statistic, there is a short code snippet in our notebook that you can use to start comparing samples to each other and samples to DF contained in the stats package of scipy.

Enjoy!

The featured image was taken from here.

Trying out Copula packages in Python – II

And here we go with the copula package in (the sandbox of) statsmodels! You can look at the code first here.

I am in love with this package. I was in love with statsmodels already, but this tiny little copula package has everything one can hope for!

suddenly the world seems such a perfect place
Summarizing my feelings about this package

First Impressions

First I was not sure about it. It looks deceptively raw, so one can understand why it would not be fair to compare this with other packages in statsmodels. After googling for examples, none could be found. Not even in the documentation of statsmodels. In fact, to find that this piece of code even existed you had to dig deep.

There are no built-in methods to calculate the parameters of the archimedean copulas, and no methods for elliptic copulas (they are not implemented). However, elliptic copulas are quite vanilla and you can implement the methods yourself. We missed the convenience of selecting a method for transforming your data into uniform marginals, but you can also implement that yourself. You could either choose to fit some parameters to a scipy distribution and then use the CDF method of that function over some samples, or work with an empirical CDF. Both methods are implemented in our notebook.

So in order to actually use the functions in this package, you have to write your own code for getting parameters for your archimedean (we borrowed some code from the copulalib package for that purpose), for transforming your variables into uniform marginal, and for actually doing anything with the copula. However as it is, is quite flexible. Is good that the developers decided to keep it anyway.

Hands on!

Allright, check out our notebook at github.