Life in a computer

Life is complicated, you probably know that. If we take a magnifying glass and look at a living thing from a chemical and biological perspective, it is astonishingly complicated. In this blog post, I will walk through an example of a process that occurs in all living things and how we can study this process with a computer. In fact, I will demonstrate that by using clever approximations, simple statistics and robust software, life does not have to be complicated.

Setting the stage

The process that we will look at is the transport of small molecules over the cell membrane. That sounds complicated, I know. So let me explain a little bit more. Each cell, in every living organism, is surrounded by a membrane that is essential for cell viability (see Figure below). It is also important for the cell to transport small molecules across the membrane. This can be for example nutrients, waste or signals.

If we can understand this process, we can utilize it to our advantage. We can design new drug molecules that enter the cell, fix the broken cell machinery and heal diseases. We can also design better biofuel-producing cells and assess environmental effects, but that is another story.

blog-cell

Here, we want to estimate how fast a molecule is transported. We will use the following assumptions that make the modeling much easier:

  • We will assume that the small molecules cross the membrane by themselves
  • We will approximate the membrane as a two-phase system, ignoring any chemical complexity
  • We will model one phase as water and the other as octanol, an alcohol (see Figure above)

By making these assumptions, we can reduce our problem to estimate the probability of finding the small molecule in octanol compared to water. In technical language, we are talking about a free energy or a partition coefficient, but it is good to keep in mind that this is nothing but a probability.  

multivariate regression model

We will use a very simple linear regression model to predict partition coefficients. You will soon see that this is a surprisingly good model.

In a regression model, we are trying to predict an unknown variable Y given some data X. This is done by first training the model on known Xs and Ys. In the lingo of machine learning X is called features and is some properties of the system from which we can predict Y. So, what are the features of our problem?

Recall that we are trying to predict partition coefficients of small molecules, so it is natural to select some features of the small molecules. There are many features available – over one thousand have been used in the literature!

We will use three simple features that are easy to compute:

  1. The weight of the molecule
  2. The number of possible hydrogen bonds (it will be called Hbonds)
  3. The fraction of carbon atoms in the molecules (Nonpolar)

If a molecule consists of many carbon atoms, it does not like to be in the water and prefers octanol. But if the molecule, on the other hand, can make hydrogen bonds it prefers water to octanol.

Our regression equation looks like this:

"Partition coefficient" = c0 + c1"Weight" + c2"Hbonds" + c3"Nonpolar"

and our task is to now to calculate c0c1c2, and c3. That is just four parameters – didn’t I say that life wasn’t so complicated!

We will use a database of about 600 molecules to estimate the coefficients (training the model). This database consists of experimental measurements of partition coefficients, the known Ys. To evaluate or test our model we will use some 150 molecules from another database with measured partition coefficients.

Sympathy for Data

To make and evaluate our model, we will use the open-source software Sympathy for Data. This software has capabilities to for example read data from many sources, performing advanced calculations and fitting machine learning models.

First, we will read in a table of training data from an Excel spreadsheet.

blog-sy1

And if one double-clicks on the output port of the Table node, we can have a look at the input data.

blog-sy2

The measured partition coefficient is in the Partition column and then we have several feature columns. The ones that are of interest to us is Weight, HA (heavy atoms), CA (carbon atoms), HBD (hydrogen bond donors) and HBA (hydrogen bond acceptors).

From HA and CA, we can obtain a feature that describes the fraction of carbon atoms, and from HBD and HBA, we can calculate the number of possible hydrogen bonds. These feature columns will we calculate using a Calculator node.

blog-sy3

In the calculator Node, one can do a lot of things. Here, we are creating two new columns Hbonds and Nonpolar. These columns are generated from the input table.

Next, we are using the machine learning capabilities of Sympathy for data to create a linear model. We are selecting the WeightHbonds, and Nonpolar columns as the X and the Partition column as the Y.

blog-sy4

If one double-clicks on the output port of the Fit node, we can see the fitted coefficients of the model.

blog-sy5

Remember that many hydrogen bonds tell us that the molecule wants to be in the water (a negative partition coefficient) and that many carbon atoms tell us that the molecule wants to be in octanol or the membrane (a positive partition coefficient). Unsurprisingly, we see that the Hbonds column contributes negatively to the partition coefficient (c2=–1.21) and the Nonpolar column contributes positively to the partition coefficient (c3=3.91).

How good is this model? Let’s read the test data and see! There is a Predict node in Sympathy for data that we can use to evaluate the X data from the test set.

blog-sy6

By using another Calculator node, we can compute some simple statistics. The mean absolute deviation between the model and experimental data is 0.86 log units, and the correlation coefficient R is 0.76. The following scatter plot was created with the Figure from Table node.

sy_model

This is a rather good model: first, the mean deviation is less than 1 log unit, which is about the experimental uncertainty. That is, we cannot expect or trust any lower deviation than this because of experimental error sources. Second, the correlation is significant and strong. It is possible to increase it slightly to 0.85 – 0.90, using more or very advanced features. But what is the point of that? Here, we are using a very simple set of features that we easily can interpret.

What’s next? You could use this model to predict the partition coefficient of a novel molecule. Say you are designing a new drug molecule and want to know if it has good transport properties. Calculate three simple features and plug it into the model, and you have your answer!


The data and the Sympathy for data flows can be obtained from Github: https://github.com/sgenheden/linear_logp

If you want to read more about multivariate linear regression: https://en.wikipedia.org/wiki/General_linear_model

If you want to read more about partition coefficients: https://en.wikipedia.org/wiki/Partition_coefficient

The picture of the cell was borrowed from www.how-to-draw-cartoons-online.com/cartoon-cell.html

The vHIL – The new sibling in the loop family

[redirect url=’http://combine.se/the_vhil/’ sec=’0′]
Since the introduction of separable software components and virtual testing, the development of software for mechatronic systems is taking place parallel to the procedure of producing the hardware. The progress has made it possible to shrink the time for development and also gain knowledge, through testing, at an earlier stage of the process.

The process of testing the software includes both virtual testing, as mentioned above, and actual physical testing with the “for-the-purpose” designed hardware. The whole process of testing can be compared to a multistage rocket, where a sequence of “in-the-loop”-tests are pinpointing different areas within the software development.

In this post, we will look into this family of “in-the-loop”-tests and present the different members. Most of all, we will present the latest sibling in the family, the vHIL, and the introduction of virtual processors into the “in-the-loop” environment.

Physical testing

Still, physical testing is important, since the final goal of the software is to control some physical hardware. However, physical testing has its drawback. To perform physical testing, one needs plants, physical test prototypes, which can be both expensive and time-consuming to produce. Because of this, the number of plant objects is highly restricted. Also, they might have to be shared among multiple development teams so that the effective time for testing is greatly limited. Add to this, the plants often are hard configured, which means that to test another set-up; it is necessary to send the plant to the workshop to be modified, a time-consuming and error-prone procedure.

Physical testing is a kind of a gamble since this can be the first time the software, uploaded on the ECU connected to the plant, is in contact with any physical hardware and the software can have one or several bugs that can damage the plant. Since the pressure after test prototypes can be very high, one wants to avoid the risk of damage the plants. Therefore, physical testing is first taking place late in the development process when the software has gone through some security checks; it is here the purpose of “in-the-loop” tests comes into the picture.

MIL – Model in the loop

The whole purpose of the “in-the-loop” process is to have the software as mature as possible when it has its first contact with the physical plant. At the first stage, the control software components, or control models, are tested; one checks the behaviors of the control algorithms against virtual plant model. The purpose of these virtual plant models is to simulate the physical behavior of physical test objects. Often, the software components are limited to control a specific task; which also limits the range of the virtual plant model.

The testing starts by considering single components, unit testing, and as they start to function properly, one starts to consider larger and larger groups of components. With the groups, it is possible to test the interfaces and the communications between the components.

When the tests show that the behavior is satisfying, the control algorithms are implemented. The implementation can either be performed manually by writing C-code or by using code-generating tools, like Simulink.

SIL – Software in the loop

At the software in the loop stage, the implementation of the control algorithms is validated. The compiled code is here simulated against the same virtual model of the plant used in the MIL testing. In this way, it is possible to compare the results from the two stages to control that the implementation has not caused any differences in the behavior of the system.

The problem with the SIL is that the C-code is compiled for the workstation the engineer is sitting at, not for the actual microcontroller. Between these, there are several differences, for example when it comes to precision and different data types. The microcontroller is also much limited when it comes to memory and performance.

PIL – Processor in the loop

The third stage is not that commonly used. The purpose here is to remedy the shortcomings in the SIL environment and compile the code against the microcontroller architecture. Again, the implementation of the control algorithms into compiled C-code is tested and the same setup of testing against virtual plant models, as in the first two “in-the-loop” stages, is used. Except testing the compiled code, it is also possible to check memory allocations and executions times.

HIL – Hardware in the loop

It is first at this stage; the code is leaving the host PC and is uploaded to a real ECU/MC. When leaving the host PC, the control algorithms can no longer be handled as single units or small groups of units. Instead, they are one component of many in software stacks. Here, they share space with other software components, like other control algorithms, but also with other software layers, like OS, drivers, and middleware.

To get the software stack compiled and the components in it to function together as a unit, will be the first task in the procedure to set up an HIL environment. It is not uncommon that it requires multiple iterations through all previous “in-the-loop” stages to have a mature software stack ready for the HIL environment.

The actual HIL environment consists of an ECU/MC connected to the virtual plant model through an HIL simulation box, which speeds up the simulation of the virtual plant model to real-time. The problem is that this environment suffers from some of the drawbacks common to physical testing; expensive setups shared among multiple teams doing integration and testing.

vHIL – Virtual hardware in the loop

Lately, there have appeared fast simulation models of digital ECU/MC hardware on the market, so-called virtual prototypes or VPs. These are executing the same binary software as the target MCU and can be simulated against the same virtual plant models used for the MIL, and SIL testing. This new environment is the vHIL.

It should be stated that the vHIL will not replace the HIL. Instead, it is smoothening the path between the SIL and the HIL. Since it does not require any physical ECU, the vHIL can be up and run early in the process and deliver a more mature software stack to be integrated into the HIL environment, reducing the bring-up time from weeks to days, which gives more time for actual HIL testing.

With the VP, one can achieve better control and visibility compared to testing on a real ECU/MCU. Many of the simulation platforms for the VP give the opportunity to debug, analyze and automate the testing procedure. It is possible to set breakpoints in the software for debugging, which halts the whole simulation. From these breakpoints, one can step through the software row by row and jump in and out of subroutines. The platforms can also give you a visualization of the communication paths in the software and check the code coverage during a test.

With the vHIL, more tests are performed early in the process, since one do not need to wait for a physical prototype of the ECU. Also, the vHIL is easier to scale up and have running at multiple virtual locations at the same time. More testing at an early stage will result in higher quality testing in the HIL environment or the physical testing. Also, it is possible, with the vHIL, to prepare the tests for the upcoming HIL in advance, compose and test them in a virtual environment and have them ready when the prototypes of the physical ECUs appear.

Two examples of VP platforms are the Virtualizer from Synopsys and the Vehicle System Integrator, VSI, from Mentor Graphics.

The next level of virtual verification?

[redirect url=’http://wp.combine.se/the-next-level-of-virtual-verification/
‘ sec=’0’]
This first blog post will be a brief survey about the use of virtual verification within the development of mechatronic systems. There will also be some considerations about future concepts within the field, which will give you some clues about possible topics for the upcoming posts.

This time, we will consider the development of an arbitrary mechatronic system, a system consisting of both hardware and software. Into the hardware, we count all physical components that range all the way from processors and integrated circuits through actuators and sensor to engines and ventilation circuits. While with the term software, we refer to the embedded code uploaded on processors and integrated circuits.

SystemDevelopment1

Back in the days, which is not that many years ago, the graph in the figure above could have been a good schematic view of a development process. Here, we have time along the x-axis, a start point to the left and a delivery point to the right. At this period, it was more or less necessary to have a sequential process, where the actual hardware had to be available before the development of the software could be started.

In this graph, it can be noticed that,

  • The knowledge about the system, the green curve, is increasing with time. The knowledge is obtained by testing different solutions and the more tests that can be performed, the more knowledge the developers will obtain about the system.
  • The possibility to make changes, the red curve, is instead decreasing with time. The closer one gets to the point of delivery the more limited is the possibility to make changes. The short period left to the delivery makes it hard to get large changes in place and even small changes can rock the foundation of the rigid structure that the system has become close to the delivery point.
  • The yellow marked area between the two curves and x-axis, within the considered developing time, is a measure of the effective work that can be performed during the process.

Clearly, the goal should be to maximize the efficient work during the process. The more useful work, the more tests can be performed and more bugs and faults can be found. Fewer bugs and errors result in better quality of the system and a better final product.

One important variable that we have disregarded in this graph is the cost. We do all know that there always exist alignments within companies whose main responsibility is to keep the costs as low and the income as high as possible. One efficient way to obtain this is to reduce the development time, to shortening the time-to-market, which is exactly what is visualized in the figure below.

SystemDevelopment2

Directly, two significant consequences can be noticed. First, the amount of productive work is more limited. Secondly, the sequential procedure, where the software development starts first after the hardware is in place, does not longer fit within the time for development.

The introduction of Model-Based Design, MBD, has made it possible to separate the software into different components. Some of these are in direct contact with the hardware and are interacting with it. While others, like the controller software components C-SWC, which have the purpose to control the behavior of some physical quantities, have several layers of software between themself and the hardware. To test the C-SWC, it might not be necessary to have the actual device in an early stage of the project. Instead, virtual models describing the dynamics of the physical quantities, and how they are perturbed by other quantities, can be the object to test the C-SWC code against, a so-called plant model.

IntroVirtualTesting

The entry of plant models and separable software components made it possible to start the development of the software earlier and test it on desktop computers instead on physical test benches. The effect of the virtual testing is visualized by the graph in the figure above. Before virtual testing was introduced, one had to wait to have an available test bench before testing the software, described by the blue curve in the graph. With the introduction of MBD, “only” plant models were needed to verify part of the software for bugs and faults in a virtual environment. The bugs and errors are found at an earlier stage of the process, which is what the red curve is telling us. Still, there is a need for physical testing, but the amount has now been reduced.

SystemDevelopment3

The look of the schematic view of the development process changes with the introduction of a virtual testing, see the figure above. The curve for obtaining knowledge about the system has a steeper behavior at the beginning of the process. This corresponds to the possibility to perform virtual tests at an earlier stage. Faster obtained knowledge increases the effective work performed during the process, and the question now is, how can one get even more knowledge at an early stage?

One way is to increase the number of tests that are performed and a virtual environment is an ideal location for performing tests in large numbers, see the figure below. A physical test bench is usually designed for a specific type of test, if one wants to do something outside its specification one has to rearrange the setup or build a new test bench. This can be both expensive and cost a lot of time. With virtual testing, a new test bench can just be some lines in a script away, which makes it simple to switch between test configurations and set up automated processes.

Virtual_VS_Physical

A growing field within virtual testing is Model-Based Testing, MBT, where software algorithms are used to design the test cases, run the test procedures and analyze the result. These algorithms can automatically produce a substantial number of test cases and do even feedback information from the results back to the process in order to create new and better test cases. An example is the TestWeaver algorithm that is described to play chess with the system under test 1.

Testing a system under test (SUT) is like playing chess against the SUT and trying to
drive it into a state where it violates its specification.

Most, if not all, applications presented so far have been introduced to benefit the software development. Plant models and separable software gave the developers access to the virtual test benches. Will it also be possible to use virtual hardware models to actually improve the development of the physical hardware, as well as the software?

If the virtual hardware can be in place early in the process, it would be possible to test combinations of different components in virtual test benches and obtain early knowledge for both hardware and software developers. This will, of course, require much more detailed models of the hardware that exist today, including non-trivial behaviors and limitations that could be triggered from the virtual test environment.

Hardware models of fine granularity will benefit the development of both the hardware and the software. With a structure of common virtual test benches, into which both hardware and software teams are delivering models, it will be possible to test the robustness of the systems in new ways. For example, how the software will react to signals coming from hardware components that are old and not functioning perfectly anymore? Or, how the hardware components should be designed in order to hold for the large forces which can appear with rapid actions from the control algorithms? To be able to test these kind of scenarios at an early stage will not only generate knowledge within the hardware and software teams themself, but also put the teams closer together to make it possible for them to find solutions together.

IntroVirtualHW

Model-Based Testing and virtual hardware are both two examples of concepts that will increase the knowledge about the system at an early stage and decrease the need for expensive physical test environments.

This slideshow requires JavaScript.

Looking into Maximum Spacing Estimation (MSP) & ML.

The maximum spacing estimation (MSE or MSP) is one of those not-so-known statistic tools that are good to have in your toolbox if you ever bump into a misbehaving ML estimation. Finding something about it is a bit tricky, because if you look for something on MSE, you will find “Mean Squared Error” as one of the top hits. The wikipedia page will give you a pretty good idea, so click here to check it while you are at it.

Here is a summary for the lazy: MSE or Maximum Square Estimation is about getting the parameters of a DF so that the geometric mean of “spacings” in the data are maximized. Such “spacings” are the differences between the values of the cumulative distribution function at neighbouring data points. This is also known as the Maximum Product of space estimations or MSP, because that is exactly how you calculate it. The idea is choosing the parameter values that make the observed data as uniform as possible, for a specific quantitative measure of uniformity.

So we can explain what happens here by means of an exercise. Of which we made a notebook on our github (click here). Suppose we have a distribution function. We start with the assumption of a variable X with a CDF $F(x;\theta_0)$, where $\theta_0 \in \Theta$ is an unknown parameter to be estimated, and from which we can take iid random samples. The spacings over which we will estimate the geometric mean ($D_i$) are the the differences between $F(x(i);\theta)$ and $F(x(i-1);\theta)$, for i [1,n+1]. For the giggles, let’s say our df is a Pareto I, with some shape parameter α=3.0 and a left limit in 1. This α is the θ parameter that we intend to estimate. We can draw some samples from our distribution and construct a CDF out of the samples. We can do a similar process for other values of shapes, and plot those CDFs together, which will end up looking like this:

paretoI

The first thing you will observe is that the bigger the alpha, the “closer to each other” this distributions look. For instance, the CDFs for α=5.0 and α=2.5 are much more closer to each other than the CDFs for α=2.5 and α=1.0. And that is an interesting fact to consider when using this estimator. It will probably be easier to get a “confused” result the higher the α parameter gets. So α=3.0 is not exactly the easiest choice of values for “messing around and looking at how our estimator behaves”, but is not too difficult either.

Now, back to the estimator. In this post we made a very simple exercise to look at how ML and MSE behave when estimating a value of an α parameter in a Pareto I distribution. The choice of shape parameter and distribution are completely arbitrary. In fact, I encourage you to take my code and try other distributions and other values yourself. Also to make my small laptop’s live easier, selected a subset of α values over which the search was made. Then, we obtained the best scores on each method for different sample sizes:

10,50,100,300,500,10000

And for each sample size and each method, we repeated the estimation 400 times. So, a box-and-whisker plot of this estimation together with a scatterplot for each sample size and each method can be seen here:

scatter

The dotted line in α=3.0 is where the shape parameter of each original sample is. As you can see here, ML behaved much better than MSE for small sample numbers. In both cases for small numbers of samples there was some skewness towards the bigger values. This is expected since the CDFs of this distribution get closer the higher the value of the shape parameter. ML also behaved better than MSE for 50, 100 and 300 samples. Now, this was only one result for this distribution and for α=3.0. A more definitive quantitative evaluation would require looking at more distributions, and at different points of those distributions. There was a paper suggesting that “J” shaped distributions were the strong point of MSE. I guess a more thorough quantitative valuation of MSE vs ML would be in order for a decision here. And it will also be worthy of a journal paper in addition to a small blog post such as this one ;). You can cite me too if you like to use my code ;).

When I first found this estimator I have to admit that it caused a bit of infatuation in me. Some mathematical concepts carry themselves with such beauty that you can’t help feeling an attachment. And you see everything about them with rose-tinted lenses. It made me wonder why on earth is this concept not as well known as ML. It is not super much more expensive depending on how you calculate it. For ML you need a density function, while for MSE all you need is a cumulative distribution function. And that alone is a powerful point, because all random variables have a CDF, but not all have a PDF. Granted, most distributions you will work with will probably have a PDF. Granted, ML is the workhorse for estimators, many toolboxes have it implemented. And is super easy to teach to undergrads. And it works well. In fact, it worked better than MSE for this particular example. So in spite of all the beauty, maybe the fitness function for mathematical concepts to last posterity is not beauty or elegance, but applicability.

BONUS POINTS IF YOU ARE LOOKING TO WORK WITH US: Blow my mind. Try this exercise in other distributions. Make other comparisons. Make me stop loving MSE. Or make me love it even more. I have to do something about the butterflies in my stomach! You are my only hope!

As always, you can find the code for generating the plots of this post in our notebook (click here).

Image taken from stocksnap.io.

Special Announcement: We are hiring!

 

image from stocksnap.io/photo/DDYC9U7O2P

 

Hi all! you may have been following the blog for some time, or this may be your first visit. But just the fact that you are here, taking some time to read and learn new things is awesome. Well, guess what: we want to hire awesome people who enjoy learning new things. More specifically about data science.

Now, what do we mean with “data science”? there are a number of tasks that every person working with data must have done: at some point you must have decided what kind of data you needed for your task; you must have had an idea for how to collect and transform that data to a form in which you can apply some nice math / statistics to it, and finally you must have found a way to summarize interesting aspects of the whole ordeal. 

Some people find different parts of this process more confortable than others. Some people are better at communicating and telling stories, while others are more into applying math over the data and see what comes out; others just love designing and optimizing experiments; others love building tools for data extraction and conditioning. And the thing is: all of this is data science. All of it. And when a company wants to hire a person in data science, is important to specify what do they want, because all these tasks require different sets of abilities which are very difficult to find in a single individual. 

We are putting together a team. The engineers we are looking for thrive as a team players, and have a genuine interest in data analysis, statistics and mathematics. If you are one of them, you have built a sufficiently good programming base (preferably in Python and/or R) which allows you to learn and test ideas by yourself. We want you to be free to mold your mind into what you want, and have a good time.

The Team Roles: 

Data Analyst: Is concerned with looking at the data and the context around the data, and telling a story by analysing both. This engineer knows what type of visualization is best to use, for which audience, and how to build it. This engineer can also choose the best statistics to summarise a dataset, and can help organizations build suitable Key Performance Indicators.

Data Engineer: Is concerned with extracting, molding, recommending and building hygiene methods and choosing computing frameworks to work with data. Data nowadays can be found on all kinds of formats and be stored in different ways. It can be streaming or offline. The needs of a client may have been satisfied by optimising how your programs process the data, or they may have pushed the envelope of what can be accomplished by CPUs, possibly making this engineer look into GPUs. This engineer is responsible with providing quality data over which data analysis and data science can be applied.

Data Scientist: Is concerned with building quantitative models out of data, setting and verifying the assumptions for the models, testing and maintaining models, and choosing data collection strategies and instruments. This engineer knows what things can go south very quickly when the underlying assumptions for a model are no longer valid, and is responsible for clearly communicating his team what those assumptions are.

Must-have qualifications:

  • M.Sc. in engineering with a focus on one or more of the following disciplines; statistics, mathematics, computer science, applied physics, mechatronics, electrical/electronic/nuclear engineering.
  • Fluent English
  • Experience in Python or R.

The Attitude:

  • You like having fun!
  • You are friendly!
  • You respect and trust your team as much as your own knowledge.
  • You shoot for the stars, yet can gracefully land on the moon if needed.
  • You learn on your own, yet you ask for help when you need to.
  • You don’t take all statements for facts: when reason exists, you verify ground truths and communicate your findings to those concerned.

Nice-to-have:

  • Experience from working in teams
  • Customer-oriented experience
  • Experience with community projects (Github, CRAN, StackExchange community, etc)

if this sounds a lot like you, do not hesitate to apply with your CV and cover letter here:

https://www.linkedin.com/jobs2/view/254519957