Functional Principal Component Analysis

Introduction

Functional Principal Component Analysis (FPCA) is a generalization of PCA where entire functions act as samples (X in L^2(mathcal{T}) over time a interval mathcal{T}) instead of scalar values (X in mathbb{R}^p). The FPCA can be used to find the dominant modes of a set of functions. One of the central ideas is to redefine the scalar product from beta^T x = left langle beta, x right rangle = sum_j beta_j x_j into a functional equivalent leftlangle beta, x rightrangle = int_{mathcal{T}} beta(s) x(s) ds.

Temperature in Gothenburg

Using data from SMHI, we are going to look at variations of temperature over the year in Gothenburg.

temperature_per_month_gothenburg

The data spans from 1961 to today and all measurements have been averaged per month and grouped by year. To be able to do an FPCA we need to remove the mean from the data.

temperature_per_month_gothenburg_no_mean

The first principal component of the data which explains 94% of the total variation is unsurprisingly the variation over seasons followed by the second and third principal components at 1.8% and 0.95% respectively.

 

Looking at the scores for the two first components gives us an idea which years differ the most from each other, i.e., the points which are farthest away from each other.

temperature_per_month_gothenburg_scores

Horizontally, the years 1989 and 2010 seem to be different for the first principal component. Apparently, the winter of 1989 was much warmer than 2010.

temperature_per_month_gothenburg_compare_1989_vs_2010

The years 1971 and 2004 are very close to each other which suggests that they should be very similar, and they are.

temperature_per_month_gothenburg_show_1971_vs_2004

The second principal component represents a mode where the late winter differs from the autumn/early winter between the years. The year 2006 had a cold early year and a warm late year while 2002 was warm to start with and cold at the end.

temperature_per_month_gothenburg_show_2002_vs_2006.png

Conclusion

The FPCA is a powerful tool when analyzing variations in functional data. It applies to multidimensional functional data as well. Functional data analysis, in general, is a powerful tool which also can be used to categorization where different clusters of, e.g., motion trajectories needs to be found.

Resource Constrained Scheduling

[redirect url=’http://combine.se/resource_comstrained_scheduling/’ sec=’0′]

Introduction

Scheduling constrained resources over time is a tough problem. In fact, the problem is NP-hard. One part of the problem is to find a feasible solution where all constraints are satisfied simultaneously. Another part is to also find a solution which also satisfies some measure of optimality. In practice, it is often sufficient to solve the problem partially such that the solution is better than the first feasible solution which has been found.

Solving combinatorial scheduling problems can be done using mixed-integer linear programming (MIP) or some heuristic approaches such as the Tabu search.

The Flow Approach

There are several ways to formulate the scheduling problem. The flow formulation presented by Christian Artigues in 2003 is one interesting approach. Assume for simplicity a process with two tasks:

full-graph-2-tasks

The nodes 1 and 2 are representing the tasks. “S” is the start task and no edges can go back here. “E” is the end task and only incoming edges are allowed. The directed graph shows all possible edges for this configuration per resource type.

The graphs tell us how resources can be transported between different tasks. Hence, “S” can give resources to task 1 and task 2 while also sending resources which are not needed directly to “E.”

Tasks can handle resources in three different ways:

  1. A task can consume a resource such that it disappears.
  2. A task can produce a new resource making it available for someone else to interact with.
  3. A task can pass a resource through for others to use (e.g., a person or a tool).

Assume that we have the following resources available in “S” to start with:

  • One operator.
  • Two tools.
  • Three raw materials.

Task 1 requires the following to be able to produce one product A.

  • One operator.
  • One tool.
  • One raw material.

Task 2 requires the following to produce one product B.

  • One operator.
  • Two tools.
  • One product like the one produced by task 1.
  • One raw material.

This problem can easily be solved by hand yielding:

solution-graph-2-tasks.png

The solution is an acyclic directed graph where resources flow from “S” to “E” fulfilling the constraints given by each node. Based on the solution task 2 must wait for task 1 to finish. Also, note that the edge between 2 and 1 has been removed since it is not needed.

Solving more complex resource constrained scheduling problems increases the available number of combinations rapidly making the problem harder and harder to solve as can be seen for the full graphs for 3, 6 and 10 tasks:

Scheduling

In some cases, it is possible to change the order between tasks or even execute them in parallel without violating any constraints. The graph tells us whether any task must follow any other task(s), either directly or through a chain of events. What is left is to take the duration of each task into consideration to produce the final time schedule.

schedule-2-tasks.png

Conclusion

The resource-constrained scheduling problem is relatively simple to solve if there is either excess of tasks or resources compared to the other. The number of combinations is much reduced then. When the distribution of resources is close to the number of tasks involved the problem gets much harder to solve.

Playing around with the measure of optimality can yield many different results depending on the formulation. For example, some tasks could be prioritized to be executed before others and tasks could be distributed over a time interval to maximize robustness for deviations and so forth.

The problem can be solved using mixed-integer linear programming (MIP) for which there are several mature solvers available on the market.

Solving Ordinary Linear Differential Equations with Random Initial Conditions

Introduction

Ordinary linear differential equations can be solved as trajectories given some initial conditions. But what if your initial conditions are given as distributions of probability? It turns out that the problem is relatively simple to solve.

Transformation of Random Variables

If we have a random system described as

dot{X}(t) = f(X(t),t) qquad X(t_0) = X_0

we can write this as

X(t) = h(X_0,t)

which is an algebraic transformation of a set of random variables into another representing a one-to-one mapping. Its inverse transform is written as

X_0 = h^{-1}(X,t)

and the joint density function f(x,t) of X(t) is given by

f(x,t) = f_0 left[ x_0 = h^{-1}(x,t) right] left| J right|

where J is the Jacobian

J = left| frac{partial x^T_0}{partial x} right|.

Solving Linear Systems

For a system of differential equations written as

dot{x}(t) = A x(t) + B u(t)

a transfer matrix can be defined

Phi(t,t_0) = e^{A(t-t_0)}

which can be used to write the solution as

x(t) = Phi(t,t_0) x(0) + int_{t_0}^{t} {Phi(t,s) B u(t) ds}.

The inverse formulation of this solution is

x(0) = Phi^{-1}(t,t_0) x(t) - Phi^{-1}(t,t_0) int_{t_0}^{t} {Phi(t,s) B u(t) ds}.

Projectile Trajectory Example

Based on the formulations above we can now move on to a concrete example where a projectile is sent away in a vacuum. The differential equations to describe the motion are

left{ begin{array}{rcl} dot{p}_{x_1}(t) & = & p_{x_2}(t) \ dot{p}_{x_2}(t) & = & 0 \ dot{p}_{y_1}(t) & = & p_{y_2}(t) \ dot{p}_{y_2}(t) & = & -g end{array} right.

where p_{x_1} and p_{y_1} are cartesian coordinates of the projectile in a two dimensional space while p_{x_2} is the horizontal velocity and p_{y_2} is the vertical velocity. We only have gravity as external force (-g) and no wind resistance which means that the horizontal velocity will not change.

The matrix representation of this system becomes

A = left( begin{array}{cccc} 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 1 \ 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 end{array} right)

with

B^T = left( begin{array}{cccc} 0 & 0 & 0 & 1 end{array} right).

The transfer matrix is (matrix exponential, not element-wise exponential)

Phi(t,t_0) = e^{A(t-t_0)} = left( begin{array}{cccc} 1 & 0 & t-t_0 & 0 \ 0 & 1 & 0 & t-t_0 \ 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 1 end{array} right)

Calculating the solution of the differential equation gives

x(t) = Phi(t,0) x(0) + int_0^t {Phi(t,s) B u(t) ds}

where u(t) = -g and x^T(0) = left( begin{array}{cccc} 0 & 0 & v_x & v_y end{array} right). The parameters v_x and v_y are initial velocities of the projectile.

The solution becomes

x(t) = left( begin{array}{c} v_x t \ v_y t - frac{g t^2}{2} \ v_x \ v_y - g t end{array} right)

and the time when the projectile hits the ground is given by

p_y(t) = v_y t - frac{g t^2}{2} = 0 qquad t > 0

as

t_{y=0} = 2 frac{v_y}{g}.

A visualization of the trajectory given v_x = 1 and v_y = 2 with gravity g = 9.81 shows an example of the motion of the projectile:

trajectory

Now, if assume that the initial state x(0) can be described by a joint Gaussian distribution we can use the formula shown earlier to say that

f(x,t) = f_0left[x(0)=h^{-1}(x,t)right] left|Jright| = frac{1}{sqrt{left|2 pi Sigma right|}} e^{-frac{1}{2}(x(0)-mu)^T Sigma^{-1} (x(0)-mu)},

where left| J right| = left| Phi^{-1}(t) right| , mu^T = left( begin{array}{cccc} 0 & 0 & v_x & v_y end{array} right) and

Sigma = left( begin{array}{cccc} 0.00001 & 0 & 0 & 0 \ 0 & 0.00001 & 0 & 0 \ 0 & 0 & 0.01 & 0 \ 0 & 0 & 0 & 0.01 end{array} right)

which means that we have high confidence in the firing position but less in the initial velocity.

We are only interested in where the projectile lands and we can marginalize the velocities to get:

fleft(p_{x_1},p_{y_1},tright) = int_{-infty}^{infty} int_{-infty}^{infty} f(x,t) dp_{x_2} dp_{y_2}

which when plotted gives

trajectory_landing

Since we have used the landing time for the deterministic trajectory, we get a spread across the y-axis as well (the ground is located at p_y = 0). We could marginalize the y-direction as well to end up with:

trajectory_landing_1d

This shows the horizontal distribution of the projectile at the time when the deterministic trajectory of the projectile is expected to hit the ground.

Conclusion

Given a set of ordinary differential equations, it is possible to derive the uncertainty of the states given a probability distribution in the initial conditions. There are two other important cases to look into as well: stochastic input signals and random parameters.

Bayesian Regression Using MCMC

[redirect url=’http://wp.combine.se/bayesian-regression-using-mcmc/’ sec=’0′]

Introduction

Bayesian Regression has traditionally been very difficult to work with since analytical solutions are only possible for simple problems. Hence, the frequentist method called “least-squares regression” has been dominant for a long time. Bayesian approaches, however, has the advantage of working with probability distributions such that the estimated parameters of a model are represented by a probability distribution (formally we talk about “likelihood” for parameters and “probability” for random variables).

For more complex models it is possible to solve the Bayesian problem numerically using, for example, MCMC (Markov-Chain Monte-Carlo). It is a computationally expensive method which gives the solution as a set of points in the parameter space which are distributed according to the likelihood of the parameters given the data at hand.

Regression Example

For this example, we are working with a linear model of the form f(x)=a+bx +epsilon, where epsilon sim N(0,sigma^2) (normal distributed noise).

First, we need to generate some random data starting choosing 50 samples where a=1, b=2, sigma^2=1 :

data1_50

One simple way to solve the MCMC-problem is the Metropolis-Hastings method. It is based on evaluating changes in the posterior likelihood function one parameter at a time doing a random walk trying to stay in a region with high probability all the time. If the likelihood is multi-modal, it is, of course, possible to get stuck in one mode.

The resulting estimated likelihood given 100,000 samples for the linear regression is shown below where the red dot represents the highest likelihood, and the blue dot is the real parameters. The contours show a smoothed kernel estimate of the density of the distribution. Note that there is a slight covariance between parameters a and b which means that if you change one of the parameters the other has to change as well.

dist1_50

It turns out that the maximum likelihood of the MCMC-estimate and the Least-Squares method gives the same result, which is expected since maximum likelihood and least-squares are equal in the presence of Gaussian noise.

regression1_50

This example has just been a simple demonstration of how to find a good fit for model parameters given some data measurement. Based on the likelihood plots above we obtain some understanding of the sensitivity of changes in the parameters and if they are likely to be correlated to obtain maximum likelihood.

In the presence of non-gaussian noise and high dimensional complex models, MCMC might be your only way to obtain a solution at all at the cost of long durations of computation.

Progressive Self-Exploring Design of Experiments

Introduction

In a classical design of experiments (DoE) you usually choose a set of points according to some rule and perform experiments to be able to, for example, create a response surface. But when the properties of the process you are trying to describe is difficult to understand and can be destroyed if wrong parameters are applied we have to try something different.

One solution could be to build a predictive model each time a new sample has been taken and decide where to take the next sample given information taken from the updated model. I am going to show you how Gaussian Processes (see the introduction) can be used to collect samples efficiently. In short, the algorithm teaches itself how the process works by asking the correct questions based on what is known, slowly expanding its knowledge safely.

Ingredients

The properties of the Gaussian Process relies on the chosen kernel. In this example, the squared exponential is used which for e^{-x^2} looks like:

doe-1d-kernel

This kernel is used to control the curvature of the estimated function.

The formula for estimating the conditional distribution of the Gaussian Process gives us an expression the covariance:

text{cov}(mathbf{f}_*)=K(X_*,X_*)-K(X_*,X)left[K(X,X)+sigma_n^2Iright]^{-1} K(X,X_*)

What is nice about this formula is that it is not dependent on any measurements. Given a kernel and a set of hyperparameters you only need to decide where you want to measure to understand what uncertainty you should expect when predicting the function. This fact makes it possible to design a space-filling experiment design for a given assumption of the properties of the model.

Now recall when we have some measurements we can generate a model such as:

doe-1d-gp-example

The gray area shows one standard deviation. When the standard deviation is small, we can make good predictions about the function while higher standard deviation indicates that we lack information. Given the four samples, we should be tempted to measure where the standard deviation is high. Just looking at the standard deviation as a function clarifies this thought:

doe-1d-std

If we have defined a limited domain on the horizontal axis, it should be straightforward to choose the point with the highest standard deviation. This is ok as long as the process cannot be destroyed for a set of parameters. Assume that we do not know exactly for which parameters we reach safety limits, then we need to expand slowly from the measurements we are aware are safe. One way of doing this is to use the squared exponential kernel to include an allowed action radius. Drawing some kernels around the measurements looks like:

doe-1d-kernels

And if we take the maximum of these four functions we get:

doe-1d-kernels-max-kern

 

Notice that the maximum is small between the two points on the left while the kernels are smeared together on the right since they are closer together. This function can be used to describe how safe it is to measure at a given set of parameters.

We can now combine the kernels with the standard deviation by taking the product ending up with:

doe-1d-kernels-std-max-product

Now we are encouraged to measure in the vicinity of each data point, but not too close and not too far away. Since the standard deviation is lower when points are closer to each other exploration is often prioritized before refining.

Simulation

We are going to try to generate a model of the function f(x) = (x-0.5)^2 + (x+0.5)^2 + sin(1.1 pi x) on the interval x in [-2,2] constrained by f(x) < 4 as seen here:

doe-1d-test-function

We need to have some knowledge about the process to be able to give the process one or several safe points to start from. We are going to start with $x_0 = 0$ and the goal is to obtain a sequence of $p_i = (x_i, f(x_i))$ for which we can predict the function with good precision.

To find a new candidate we need to have a set of candidates to choose from. The set of candidates are generated using a space-filling random algorithm, in our case the Sobol sequence.

Here is a sequence of 21 samples taken using the method described above.

fitting1

Notice how the algorithm is cautious to start with and then starts expanding to the right and left, occasionally going back to refine the model instead of exploring. It also does not violate the condition f(x) leq 4.

Final Discussion

Progressive sampling is useful when the process you want to describe is nonlinear and when you need to avoid breaking any constraints. The method scales well to many dimensions and can be automated in actual physical testing environments. We can also handle noisy measurements which would result in slower propagation since the uncertainty of predictions would be larger.

We could add additional constraints which are tailored to the problem at hand, for example scaling the width of the kernel depending on the estimated magnitude of the gradient for each measurement or adding other functions which control how samples are chosen.