## Life in a computer

Life is complicated, you probably know that. If we take a magnifying glass and look at a living thing from a chemical and biological perspective, it is astonishingly complicated. In this blog post, I will walk through an example of a process that occurs in all living things and how we can study this process with a computer. In fact, I will demonstrate that by using clever approximations, simple statistics and robust software, life does not have to be complicated.

## Setting the stage

The process that we will look at is the transport of small molecules over the cell membrane. That sounds complicated, I know. So let me explain a little bit more. Each cell, in every living organism, is surrounded by a membrane that is essential for cell viability (see Figure below). It is also important for the cell to transport small molecules across the membrane. This can be for example nutrients, waste or signals.

If we can understand this process, we can utilize it to our advantage. We can design new drug molecules that enter the cell, fix the broken cell machinery and heal diseases. We can also design better biofuel-producing cells and assess environmental effects, but that is another story.

Here, we want to estimate how fast a molecule is transported. We will use the following assumptions that make the modeling much easier:

• We will assume that the small molecules cross the membrane by themselves
• We will approximate the membrane as a two-phase system, ignoring any chemical complexity
• We will model one phase as water and the other as octanol, an alcohol (see Figure above)

By making these assumptions, we can reduce our problem to estimate the probability of finding the small molecule in octanol compared to water. In technical language, we are talking about a free energy or a partition coefficient, but it is good to keep in mind that this is nothing but a probability.

## A multivariateregression model

We will use a very simple linear regression model to predict partition coefficients. You will soon see that this is a surprisingly good model.

In a regression model, we are trying to predict an unknown variable Y given some data X. This is done by first training the model on known Xs and Ys. In the lingo of machine learning X is called features and is some properties of the system from which we can predict Y. So, what are the features of our problem?

Recall that we are trying to predict partition coefficients of small molecules, so it is natural to select some features of the small molecules. There are many features available – over one thousand have been used in the literature!

We will use three simple features that are easy to compute:

1. The weight of the molecule
2. The number of possible hydrogen bonds (it will be called Hbonds)
3. The fraction of carbon atoms in the molecules (Nonpolar)

If a molecule consists of many carbon atoms, it does not like to be in the water and prefers octanol. But if the molecule, on the other hand, can make hydrogen bonds it prefers water to octanol.

Our regression equation looks like this:

"Partition coefficient" = c0 + c1"Weight" + c2"Hbonds" + c3"Nonpolar"

and our task is to now to calculate c0c1c2, and c3. That is just four parameters – didn’t I say that life wasn’t so complicated!

We will use a database of about 600 molecules to estimate the coefficients (training the model). This database consists of experimental measurements of partition coefficients, the known Ys. To evaluate or test our model we will use some 150 molecules from another database with measured partition coefficients.

## Sympathy for Data

To make and evaluate our model, we will use the open-source software Sympathy for Data. This software has capabilities to for example read data from many sources, performing advanced calculations and fitting machine learning models.

First, we will read in a table of training data from an Excel spreadsheet.

And if one double-clicks on the output port of the Table node, we can have a look at the input data.

The measured partition coefficient is in the Partition column and then we have several feature columns. The ones that are of interest to us is Weight, HA (heavy atoms), CA (carbon atoms), HBD (hydrogen bond donors) and HBA (hydrogen bond acceptors).

From HA and CA, we can obtain a feature that describes the fraction of carbon atoms, and from HBD and HBA, we can calculate the number of possible hydrogen bonds. These feature columns will we calculate using a Calculator node.

In the calculator Node, one can do a lot of things. Here, we are creating two new columns Hbonds and Nonpolar. These columns are generated from the input table.

Next, we are using the machine learning capabilities of Sympathy for data to create a linear model. We are selecting the WeightHbonds, and Nonpolar columns as the X and the Partition column as the Y.

If one double-clicks on the output port of the Fit node, we can see the fitted coefficients of the model.

Remember that many hydrogen bonds tell us that the molecule wants to be in the water (a negative partition coefficient) and that many carbon atoms tell us that the molecule wants to be in octanol or the membrane (a positive partition coefficient). Unsurprisingly, we see that the Hbonds column contributes negatively to the partition coefficient (c2=–1.21) and the Nonpolar column contributes positively to the partition coefficient (c3=3.91).

How good is this model? Let’s read the test data and see! There is a Predict node in Sympathy for data that we can use to evaluate the X data from the test set.

By using another Calculator node, we can compute some simple statistics. The mean absolute deviation between the model and experimental data is 0.86 log units, and the correlation coefficient R is 0.76. The following scatter plot was created with the Figure from Table node.

This is a rather good model: first, the mean deviation is less than 1 log unit, which is about the experimental uncertainty. That is, we cannot expect or trust any lower deviation than this because of experimental error sources. Second, the correlation is significant and strong. It is possible to increase it slightly to 0.85 – 0.90, using more or very advanced features. But what is the point of that? Here, we are using a very simple set of features that we easily can interpret.

What’s next? You could use this model to predict the partition coefficient of a novel molecule. Say you are designing a new drug molecule and want to know if it has good transport properties. Calculate three simple features and plug it into the model, and you have your answer!

The data and the Sympathy for data flows can be obtained from Github: https://github.com/sgenheden/linear_logp

The picture of the cell was borrowed from www.how-to-draw-cartoons-online.com/cartoon-cell.html

## Bayesian Regression Using MCMC

[redirect url=’http://wp.combine.se/bayesian-regression-using-mcmc/&#8217; sec=’0′]

## Introduction

Bayesian Regression has traditionally been very difficult to work with since analytical solutions are only possible for simple problems. Hence, the frequentist method called “least-squares regression” has been dominant for a long time. Bayesian approaches, however, has the advantage of working with probability distributions such that the estimated parameters of a model are represented by a probability distribution (formally we talk about “likelihood” for parameters and “probability” for random variables).

For more complex models it is possible to solve the Bayesian problem numerically using, for example, MCMC (Markov-Chain Monte-Carlo). It is a computationally expensive method which gives the solution as a set of points in the parameter space which are distributed according to the likelihood of the parameters given the data at hand.

## Regression Example

For this example, we are working with a linear model of the form $f(x)=a+bx +epsilon$, where $epsilon sim N(0,sigma^2)$ (normal distributed noise).

First, we need to generate some random data starting choosing 50 samples where $a=1, b=2, sigma^2=1$:

One simple way to solve the MCMC-problem is the Metropolis-Hastings method. It is based on evaluating changes in the posterior likelihood function one parameter at a time doing a random walk trying to stay in a region with high probability all the time. If the likelihood is multi-modal, it is, of course, possible to get stuck in one mode.

The resulting estimated likelihood given 100,000 samples for the linear regression is shown below where the red dot represents the highest likelihood, and the blue dot is the real parameters. The contours show a smoothed kernel estimate of the density of the distribution. Note that there is a slight covariance between parameters $a$ and $b$ which means that if you change one of the parameters the other has to change as well.

It turns out that the maximum likelihood of the MCMC-estimate and the Least-Squares method gives the same result, which is expected since maximum likelihood and least-squares are equal in the presence of Gaussian noise.

This example has just been a simple demonstration of how to find a good fit for model parameters given some data measurement. Based on the likelihood plots above we obtain some understanding of the sensitivity of changes in the parameters and if they are likely to be correlated to obtain maximum likelihood.

In the presence of non-gaussian noise and high dimensional complex models, MCMC might be your only way to obtain a solution at all at the cost of long durations of computation.

## Introduction

In a classical design of experiments (DoE) you usually choose a set of points according to some rule and perform experiments to be able to, for example, create a response surface. But when the properties of the process you are trying to describe is difficult to understand and can be destroyed if wrong parameters are applied we have to try something different.

One solution could be to build a predictive model each time a new sample has been taken and decide where to take the next sample given information taken from the updated model. I am going to show you how Gaussian Processes (see the introduction) can be used to collect samples efficiently. In short, the algorithm teaches itself how the process works by asking the correct questions based on what is known, slowly expanding its knowledge safely.

## Ingredients

The properties of the Gaussian Process relies on the chosen kernel. In this example, the squared exponential is used which for $e^{-x^2}$ looks like:

This kernel is used to control the curvature of the estimated function.

The formula for estimating the conditional distribution of the Gaussian Process gives us an expression the covariance:

$text{cov}(mathbf{f}_*)=K(X_*,X_*)-K(X_*,X)left[K(X,X)+sigma_n^2Iright]^{-1} K(X,X_*)$

What is nice about this formula is that it is not dependent on any measurements. Given a kernel and a set of hyperparameters you only need to decide where you want to measure to understand what uncertainty you should expect when predicting the function. This fact makes it possible to design a space-filling experiment design for a given assumption of the properties of the model.

Now recall when we have some measurements we can generate a model such as:

The gray area shows one standard deviation. When the standard deviation is small, we can make good predictions about the function while higher standard deviation indicates that we lack information. Given the four samples, we should be tempted to measure where the standard deviation is high. Just looking at the standard deviation as a function clarifies this thought:

If we have defined a limited domain on the horizontal axis, it should be straightforward to choose the point with the highest standard deviation. This is ok as long as the process cannot be destroyed for a set of parameters. Assume that we do not know exactly for which parameters we reach safety limits, then we need to expand slowly from the measurements we are aware are safe. One way of doing this is to use the squared exponential kernel to include an allowed action radius. Drawing some kernels around the measurements looks like:

And if we take the maximum of these four functions we get:

Notice that the maximum is small between the two points on the left while the kernels are smeared together on the right since they are closer together. This function can be used to describe how safe it is to measure at a given set of parameters.

We can now combine the kernels with the standard deviation by taking the product ending up with:

Now we are encouraged to measure in the vicinity of each data point, but not too close and not too far away. Since the standard deviation is lower when points are closer to each other exploration is often prioritized before refining.

## Simulation

We are going to try to generate a model of the function $f(x) = (x-0.5)^2 + (x+0.5)^2 + sin(1.1 pi x)$ on the interval $x in [-2,2]$ constrained by $f(x) < 4$ as seen here:

We need to have some knowledge about the process to be able to give the process one or several safe points to start from. We are going to start with $x_0 = 0$ and the goal is to obtain a sequence of $p_i = (x_i, f(x_i))$ for which we can predict the function with good precision.

To find a new candidate we need to have a set of candidates to choose from. The set of candidates are generated using a space-filling random algorithm, in our case the Sobol sequence.

Here is a sequence of 21 samples taken using the method described above.

Notice how the algorithm is cautious to start with and then starts expanding to the right and left, occasionally going back to refine the model instead of exploring. It also does not violate the condition $f(x) leq 4$.

## Final Discussion

Progressive sampling is useful when the process you want to describe is nonlinear and when you need to avoid breaking any constraints. The method scales well to many dimensions and can be automated in actual physical testing environments. We can also handle noisy measurements which would result in slower propagation since the uncertainty of predictions would be larger.

We could add additional constraints which are tailored to the problem at hand, for example scaling the width of the kernel depending on the estimated magnitude of the gradient for each measurement or adding other functions which control how samples are chosen.

## Introduction

When trying to describe data using a function you often know something about the process generating the data a priori. When you do not completely understand why the data looks like it does but want to try to describe it any way you can start trying different things ad hoc.

A couple of years ago I began working with statistical methods to describe data. For points of a (continuous and smooth) function in a relevant neighborhood, it is not unreasonable to assume that values of these points do not differ much. Points close to each other experience a high level of correlation while points farther away are less correlated (but not necessarily if we experience periodic behavior). Given a set of points we want to evaluate in a function, we could introduce a multidimensional joint statistical distribution which describes how each point is related to all other points. The Gaussian Process is one way of doing this by assuming that all points are related to each other based on a joint Gaussian distribution.

Gaussian Processes are usually mentioned as a “parameterless” regression method. Parameterless is valid to some extent, but in practice, there are a set of so called hyper parameters which need to be tuned. To obtain a numerical value of the amount of covariance between different points we can use a kernel function. The squared exponential is a common function to start with: $kleft(x, x'right) = -frac{left||x - x'right||^2_2}{2 ell^2}$. The parameter $ell$ controls how smooth the functions described by the joint Gaussian distribution are and can be seen as a length-scale.

## Drawing Random Functions from the Prior Distribution

The first thing we can do is to draw some realizations from the distribution. We start by defining a set of points evenly distributed between -1 and 1 called $mathbf{x}_*$. For these points, we want to obtain random function values distributed according to $f(mathbf{x}) sim mathcal{GP}left(m(mathbf{x}), K(mathbf{x}, mathbf{x'})right)$, where $m(mathbf{x})$ is a mean value function and $k(mathbf{x}, mathbf{x'})$ the correlation matrix.

We need to calculate how each point in the the set $mathbf{x}_*$ is related to all other points given the kernel function

$K(mathbf{x}_*, mathbf{x}_*) = left( begin{array}{cccc} k(x_1, x_1) & k(x_1, x_2) & cdots & k(x_1, x_n) \ k(x_2, x_1) & k(x_2, x_2) & cdots & k(x_2, x_n) \ vdots & vdots & ddots & vdots \ k(x_n, x_1) & k(x_n, x_2) & cdots & k(x_n, x_n) end{array} right)$,

followed by generating the functions $f_* sim mathcal{N}left(m, Kright)$ by calculating $fleft(x_*right) = m + Lu$ where $u sim mathcal{N}left(0, Iright)$ and $L$ is the lower triangular Cholesky factorization $K = LL^T$ (sometimes called the square root of matrices). We are setting $m = 0$ to have the distribution for the functions centered around zero.

Generating three random such functions using $ell = 0.3$ gives the following figure (for our random number generator) where the gray area shows +/- 1.96 standard deviations.

The generated functions are continuous and smooth, and their curvature is similar to each other. Interesting things start to happen when we throw prior knowledge on the calculations.

## Drawing Random Functions from the Posterior Distribution

Assume that we have measured a sample at $x = 0$ which is $y(0) = 1$. We would like to draw new random functions from the distribution given this measurement. The posterior functions are distributed according to

$mathbf{f_*} | X_*,X,mathbf{y} sim mathcal{N} left( K(X_*,X)K(X,X)^{-1}mathbf{y}, K(X_*,X_*)-K(X_*,X)K(X,X)^{-1}K(X,X_*) right)$

where the symbols subscripted with * represent points we want to evaluate, and those without * are observations.

The same procedure is used as in the previous case resulting in the following figure. Our sample is shown as a black filled circle. All generated functions are crossing the point, and the gray area becomes small around the point telling us that it is certain that the function must contain the point.

Adding two more points restricts the possible outcome even more. The functions must pass through the three points since they are regarded as “truth.” Due to the limited curvature, the functions must focus on getting to the next point in-between.

Fine, generating random functions is nice, but what else can we do? Since the functions are drawn from a joint Gaussian distribution, we can calculate the expected value of the distribution.

## Regression Using the Expected Function

The expected function for our prior distribution example is just zero. For the case with one measurement point, we get a line which approaches zero at the edges and passes through the point.

The gray area gives us a measure of the certainty of the expected function meaning that the function could be just about anything outside the measurement point, something which can be seen in the examples where we draw functions from the distribution above.

When we have three measurements present the expected function is reasonably sure between the points, but the uncertainty quickly increases where there is no data present. The expected function converges to the mean function at the edges (which is zero in this case).

For computer experiments the measurements are exact, but for real world measurements, we can seldom exclude noise from the equation.

## Kernel Hyperparameters

We can add some slack to the regression by writing the kernel function as

$k_yleft(x_p, x_qright) = sigma^2_f exp left(-frac{1}{2ell^2}left(x_p - x_qright)^2right) + sigma^2_n delta_{pq}$

where $ell$ is the length-scale, $sigma^2_f$ is the signal variance and $sigma^2_n$ is the variance of the noise. These three parameters are called “hyperparameters” and are used to find the best fit of the expected function with respect to some measure.

The figures below show what happens when the three parameters are varied. The function variance is the prior variance when no measurements have been obtained and are changed for each column; the noise variance is changed row-wise and the length-scale changes per image.

This slideshow requires JavaScript.

## Gaussian Processes in Practice

Gaussian Processes works well for many dimensions of data. The drawback is expensive calculations involving a matrix inversion of the part of the covariance where the measurement data is located. If we have many measurements, it quickly becomes computationally intractable to perform the inversion concerning both computational complexity and computer memory requirements.

The inversion has to be performed many times since the hyperparameters are optimized in order to obtain the best fit for the regression. The optimization process is in general nonlinear since the log marginal likelihood is used.

One way to go around this problem is to truncate the covariance function to zero for small values and use sparse matrices and solving calculations involving matrix inversions using conjugate gradient methods. There are other ways to approximate the problem as well.

We will describe other aspects of Gaussian Processes in future postings.

## Introduction

When trying to describe data using a function you often know something about the process generating the data a priori. When you do not completely understand why the data looks like it does but want to try to describe it any way you can start trying different things ad hoc.

A couple of years ago I began working with statistical methods to describe data. For points of a (continuous and smooth) function in a relevant neighborhood, it is not unreasonable to assume that values of these points do not differ much. Points close to each other experience a high level of correlation while points farther away are less correlated (but not necessarily if we experience periodic behavior). Given a set of points we want to evaluate in a function, we could introduce a multidimensional joint statistical distribution which describes how each point is related to all other points. The Gaussian Process is one way of doing this by assuming that all points are related to each other based on a joint Gaussian distribution.

Gaussian Processes are usually mentioned as a “parameterless” regression method. Parameterless is valid to some extent, but in practice, there are a set of so called hyper parameters which need to be tuned. To obtain a numerical value of the amount of covariance between different points we can use a kernel function. The squared exponential is a common function to start with: $k\left(x, x'\right) = -\frac{\left||x - x'\right||^2_2}{2 \ell^2}$. The parameter $\ell$ controls how smooth the functions described by the joint Gaussian distribution are and can be seen as a length-scale.

## Drawing Random Functions from the Prior Distribution

The first thing we can do is to draw some realizations from the distribution. We start by defining a set of points evenly distributed between -1 and 1 called $\mathbf{x}_*$. For these points, we want to obtain random function values distributed according to $f(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), K(\mathbf{x}, \mathbf{x'})\right)$, where $m(\mathbf{x})$ is a mean value function and $k(\mathbf{x}, \mathbf{x'})$ the correlation matrix.

We need to calculate how each point in the the set $\mathbf{x}_*$ is related to all other points given the kernel function

$K(\mathbf{x}_*, \mathbf{x}_*) = \left( \begin{array}{cccc} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{array} \right)$,

followed by generating the functions $f_* \sim \mathcal{N}\left(m, K\right)$ by calculating $f\left(x_*\right) = m + Lu$ where $u \sim \mathcal{N}\left(0, I\right)$ and $L$ is the lower triangular Cholesky factorization $K = LL^T$ (sometimes called the square root of matrices). We are setting $m = 0$ to have the distribution for the functions centered around zero.

Generating three random such functions using $\ell = 0.3$ gives the following figure (for our random number generator) where the gray area shows +/- 1.96 standard deviations.

The generated functions are continuous and smooth, and their curvature is similar to each other. Interesting things start to happen when we throw prior knowledge on the calculations.

## Drawing Random Functions from the Posterior Distribution

Assume that we have measured a sample at $x = 0$ which is $y(0) = 1$. We would like to draw new random functions from the distribution given this measurement. The posterior functions are distributed according to

$\mathbf{f_*} | X_*,X,\mathbf{y} \sim \mathcal{N} \left( K(X_*,X)K(X,X)^{-1}\mathbf{y}, K(X_*,X_*)-K(X_*,X)K(X,X)^{-1}K(X,X_*) \right)$

where the symbols subscripted with * represent points we want to evaluate, and those without * are observations.

The same procedure is used as in the previous case resulting in the following figure. Our sample is shown as a black filled circle. All generated functions are crossing the point, and the gray area becomes small around the point telling us that it is certain that the function must contain the point.

Adding two more points restricts the possible outcome even more. The functions must pass through the three points since they are regarded as “truth.” Due to the limited curvature, the functions must focus on getting to the next point in-between.

Fine, generating random functions is nice, but what else can we do? Since the functions are drawn from a joint Gaussian distribution, we can calculate the expected value of the distribution.

## Regression Using the Expected Function

The expected function for our prior distribution example is just zero. For the case with one measurement point, we get a line which approaches zero at the edges and passes through the point.

The gray area gives us a measure of the certainty of the expected function meaning that the function could be just about anything outside the measurement point, something which can be seen in the examples where we draw functions from the distribution above.

When we have three measurements present the expected function is reasonably sure between the points, but the uncertainty quickly increases where there is no data present. The expected function converges to the mean function at the edges (which is zero in this case).

For computer experiments the measurements are exact, but for real world measurements, we can seldom exclude noise from the equation.

## Kernel Hyperparameters

We can add some slack to the regression by writing the kernel function as

$k_y\left(x_p, x_q\right) = \sigma^2_f \exp \left(-\frac{1}{2\ell^2}\left(x_p - x_q\right)^2\right) + \sigma^2_n \delta_{pq}$

where $\ell$ is the length-scale, $\sigma^2_f$ is the signal variance and $\sigma^2_n$ is the variance of the noise. These three parameters are called “hyperparameters” and are used to find the best fit of the expected function with respect to some measure.

The figures below show what happens when the three parameters are varied. The function variance is the prior variance when no measurements have been obtained and are changed for each column; the noise variance is changed row-wise and the length-scale changes per image.

This slideshow requires JavaScript.

## Gaussian Processes in Practice

Gaussian Processes works well for many dimensions of data. The drawback is expensive calculations involving a matrix inversion of the part of the covariance where the measurement data is located. If we have many measurements, it quickly becomes computationally intractable to perform the inversion concerning both computational complexity and computer memory requirements.

The inversion has to be performed many times since the hyperparameters are optimized in order to obtain the best fit for the regression. The optimization process is in general nonlinear since the log marginal likelihood is used.

One way to go around this problem is to truncate the covariance function to zero for small values and use sparse matrices and solving calculations involving matrix inversions using conjugate gradient methods. There are other ways to approximate the problem as well.

We will describe other aspects of Gaussian Processes in future postings.