On types of randomized algorithms

There is more than Monte Carlo when talking about randomized algorithms. It is not uncommon to see the expresions “Monte Carlo Approach” and “randomized approach” used interchangeably. More than once you start reading a paper or listening to a presentation, in which the words “Monte Carlo” appear on the keywords and even on the title, and as you keep reading/listening, you notice in the algorithm description a part in which the runs with incorrect outputs are discarded until only correct outputs are given… Which effectively turns a Monte Carlo into a Las Vegas. Running time is no longer deterministic-ish, and the algorithm will only provide correct answers. So, they say/write “Monte Carlo” here, “Monte Carlo” there, and when it comes to what actually happens, you are not really in Monte Carlo, and you might be in Vegas, baby.

You can do a small check yourself. Use your favorite search engine with “monte carlo simulation” and “las vegas simulation”. Depending on your search engine, you may get something looking more or less like this:
aprox 84 200 000 results (0,32 seconds) – monte carlo simulation

aprox   9 560 000 results (0,56 seconds) – las vegas simulation

Almost a whole order of magnitude more in relevant results from Monte Carlo, in almost half the time. Now, if you used something like google, you may or may not get relevant results in your first screen, depending on how well google knows you. Of course, there are the cases of “monte carlo simulation” that most likely refer to other topics e.g. Monte Carlo integration and the like, the same with “las vegas simulation“. And depending on how much your search engine knows your habits, you might not get any results of Las Vegas simulations right away.

And probably the next thing you may do, is to do a quick wikipedia search of “Las Vegas algorithm”. You may possibly read it, and find in the end “Atlantic City algorithm”. And maybe you may want to follow that little wikipedia rabbit hole, and end up reading about it. And then no one can save you.

 

The featured image in this post is from here.

 

Basic Statistics with Sympathy – Part 4: Building arbitrary RNGs in Sympathy

Remember your friend from our very first post? . Well, I am sorry to say that he never really reached French Guyana. He ended up in Carcass, one of the Malvinas/Falkland islands. And his boat was (peacefully) captured by overly friendly pirate penguins. Now he spends his days counting penguins and sheep. He did keep a coin and this calculator as a memento from his past life. You know, just in case.

As for you, you are still at the office and now your task is to get some samples from a distribution that strangely resembles a cropped half co-sine. Your problem is that such distribution is not in NumPy (gasp!). Don’t fret, we got you covered. NumPy not having a distribution should not cause weeping or gnashing of teeth. In fact, your friend can easily build pretty much any distribution with the help of some math, his coin and his trusty office memento (provided that the coin is unbiased and that he has enough patience). This is possible because of the Probability Integral Transform (PIT). As part of this tutorial, we will do the exercise in slide 11 of this lecture, so you might want to open the slides to do the steps together.

One consequence of the PIT is that as long as you can have a random variable in which a cumulative distribution function (CDF) exists, you can then transform the CDF to express the random variable in terms of the uniform distribution. Which means that for a very long list of distributions it would be possible for you to build your own random number generator, using nothing but some simple functions and a uniform distribution sampler!

So, let’s begin with the exercise of the tutorial. We will use our trusty calculator node, so you can place 3 calculator nodes and a dummy input (which can easily be built with a random table generator). We will not use the actual contents of the dummy input, since we will generate our own signals in the calculator. Your start flow may look like this:

fig1

You can change the labels in the node by double clicking the text. Now, right click on the first calculator node (from left to right) and click “configure”. Make a variable that will be your independent axis through the calculations. If you look in the exercise, the range of the pdf goes from (-pi/2 to pi/2]. Note that the lower bound is an open bound. I will name it x, and you can build it in the calculation field with: np.linspace(-np.pi/2,np.pi/2, num=100)[1:]. You can then double click on the node to run it, in order to have “x” available as an input for the next node, which you will use to build the pdf.

In the calculation field of the node to build the pdf, you simply need to enter the same constraints that appear in the exercise: The pdf is 1/2 cos (x) for x (-pi/2,pi/2].  We could easily deal with that by removing the last sample on every variable.

np.array([0 if x > np.pi/2. or x <= -np.pi/2. else 0.5*np.cos(x) for x in ${x}])

I was first trying out how would the plots look like if I allowed more than the non-zero range to be generated. But you don’t need to do that. You can just do this instead:

np.array([0.5*np.cos(x) for x in ${x}])

Since we will build a CDF from this PDF, and we will want to plot them together, then is convenient to let the previous signal pass through. So you can simply write a variable named “x”, and drag the variable name in “Signals” of “x” into the calculations field. Your configuration screen will look something like this:

fig2

We will use densFunc and x to evaluate how well our approximation behaves in the end.

Now, building a CDF from this easily integrable analytic function is quite simple. You can go the analytic or the numeric route. For this tutorial we will also show the numeric route because not all pdfs you may run into will have an easily integrable analytic form. We use the trapezoidal rule just to show that the PIT will approximate the original random variable X even with a not-so-smooth approximation of the CDF. To build the CDF, we take each value of the integration and sum it with the previous value, like so:

np.array( [ ( (${x}[index]-${x}[index-1]) * (${densFunc}[index]+${densFunc}[index-1])/2.) + np.sum([ ( (${x}[indexK]-${x}[indexK-1]) * (${densFunc}[indexK]+${densFunc}[indexK-1])/2.) if indexK>0 else 0 for indexK in range(index) ]) if index>0 else 0 for index in range(0,len(${x})) ])

To analytically build a the CDF, you can start by integrating 1/2 cos(x) dx. Which will give you 1/2 sin(x) + C. To find C, solve the initial condition sin(pi/2) = 1, and that gives you 1/2 sin(x) + 1/2 as the function from which you can generate a CDF.

Now, if we plot the PDF and the CDF together (you can use the plot tables node, and choose x as the independent axis), they will look like this:

fig3

Ok, NOW we have all we need to generate some data with the same distribution as X. And we will both use the numeric and the analytic ways. The general way to obtain a function of the uniform distribution is in slide 10 of the presentation, so solving for x:

u = 1/2 sin(x) + 1/2 -> x= arcsin(2u-1)

You can add a new calculator node, and feed the output of CDF to it. Let us take as many samples as elements exist in our initial independent axis (just for being able to use the same axis across different graphics), and name that simRandVarFromU. You can enter in the calculation field:

np.array([np.arcsin(-1+2.*np.random.uniform(0,1)) for b in ${x}])

Now, your friend in Carcass Island may actually find it easier to use the numeric way. He probably has a lot of stones and writings on the sand (drawing a lot of slots for his coin to emulate different positions between 0 and 1), and he has been spending a lot of energy scaring away the sheep and the penguins, so let’s make his life a little bit easier. He can calculate numericSim by doing the following:

np.array([${x}[np.where(${cdf}<=np.random.uniform(0,1))[0][-1]] for inx in ${x}])

See what he did there? he rolled his coin until he got a uniformly distributed number “u” between 0 and 1. Since he knows that the range of the CDF is between 0 and 1, it is safe to see in which value of x was the CDF less than or equal to “u”, and take that as an index to obtain “x”.

After building the two estimates of X, let us find a way to compare them. So, what we are ideally constructing is a way of building a variable with a distribution as close as possible to X. So, let’s compare the distribution of:

  • Original X
  • PIT generated X (analytical)
  • PIT generated X (numeric)

For both PIT – generated cases, we first obtained a histogram, applied a smoothing over the histogram, and then re-sampled the smooth histogram for a linear space like the one in the independent variable of Original X.

whole_flow_structure

To build the histograms:

build_histograms

Notice that we removed a bin since numpy will generate 1 extra bin. To smooth the histograms, we use a convolution with a window size of 10 samples, like this:

np.convolve(${numHist},np.ones(10)/10,mode=’same’)

For both PIT generated histograms. Remember to add all the other signals to be able to make proper plots. So here is a plot of both smooth histograms and their bins:

bothsmoothHists

To be able to compare the smooth histograms and the real PDF in the same plot, we made an interpolation (to change sample size). We simply did:

resample_smooth_histograms

And after putting together all signals on the same table, we were able to produce this plot:

alltogether1

So, one can see that at least with 100 samples and a smoothing window of 10 samples, with 50 bins, we got this approximation. What would happen if we start playing with this parameters? First, changing the number of bins at this small amount of samples would not make much difference. Second, getting more samples is very cheap for you in your desktop, but changing a convolution window would be cheaper for your friend in Carcass. An increase of a smoothing window to 23 samples will give us:

alltogether2

But if we really wanted to see if the signals converge together, throw a bigger number of samples. Say, throw 1000 samples, and leave the smoothing window at size 10:

alltogether3

I hope that you can add the PIT to your back of tricks, and use it whenever you have the need. Until the next post!

This post borrowed the featured image from here.

Studying random variables with Doob-Martingales

Or “Martingales are awesome!”.

In a previous post, we talked about bounds for the deviation of a random variable from its expectation that built upon Martingales, useful for cases  in which the random variables cannot be modeled as sums of independent random variables (or in the case in which we do not know if they are independent or not, or even their distribution). We briefly explained that a martingale sequence is a sequence X_0, X_1,… so that for all i>0:

E[X_i | X_0,…,X_{i-1}] = X_{i-1}

and also:

 E[X_i] = E[X_0] for all i≥0.

We discussed two bounds over variation of expectation when we have a Martingale (Azuma and Kolmogorov-Doob). However, the point of treating random variables as Martingales was not touched.

And, how exactly can I treat any random variable as a Martingale?

Easy, cowboy. Consider a probability space (Ω,F,Pr), in which Pr is a probability measure assigned to a field (Ω,F), from which Ω is a sample space, and F is a collection of subsets of the sample space Ω. One can see a sample space as an arbitrary set containing all possible outcomes in a probabilistic experiment. Events are subsets of the sample space. For instance, if the experiment is a sequence of two coin-flips, then Ω contains {hh,ht,th,tt}. An event ε ⊆ Ω for the experiment of two coin-flips can be the cases in which we get different values on each coin flip, containing {ht,th}.  The collection of events in F must satisfy:

  1. ∅ ∈ F
  2. ε ∈ F ⇒ complement of ε also ∈ F (closure under complement)
  3. ε1, ε2,… ∈ F ⇒  ε1 ∪ ε2 ∪ … ∈ F (closure under countable union)

Closure under complement together with closure under countable union also imply closure under countable intersection. For a field in which F = 2^Ω, is possible to build sequences of nested subsets of F so that F_0 ⊆ F_1 ⊆ … ⊆ F_n. From that, we can see that F_0 contains the empty event, and F_n contains 2^Ω. Those sequences of nested subsets of F is what we call a filter or filtration in the context of probability spaces. Given an arbitrary F_i in a filter, we can say that an event ε ∈ F is F_i-measurable when ε ∈ F_i.  This basically means that an event can only be measurable in a filtration, if the subset of the sample space associated to that event is also contained in the filtration. As a consequence of filtration being nested subsets, if an event is F_i-measurable, then the event is also F_{i+1}, F_{i+2},… and definitely F_n measurable.

Now consider a random variable X over a probability space (Ω,F,Pr). X can be viewed as a function X: Ω → R, meaning that for a given sample ω ∈ Ω, the random variable will take the value X(ω) ∈ R. In other words, we have a random variable that is used to associate a random sample in the sample space with a number in the domain of the real numbers. Considering the F_i-measurability concept, we can say that the random variable X is F_i-measurable if for each x ∈ R, the event {X = x} is contained in F_i. That also means that X is F_j measurable for all j≥i.

Starting from this random variable X, we can build our Martingale. What can be said about X with respect to F_{i-1}? Since X is a function over the values of Ω, we know that the values X will take will be constants over each event. X is a function and its value does not have to be a constant for a sequence of events. It can have a different behavior. However, X is well defined enough to have an expected value over a sequence of events, and such expected value is:

E[ X | F_{i-1} ]

Which is the expected value of our random variable X over the events contained in F_{i-1}. And this expected value is in itself another random variable. A random variable that constitutes a mapping from F_{i-1} into the reals. An interesting property of this random variable, is that its value is constant if X is also F_{i-1}-measurable. This means that the expected value of X given F_{i-1} will be constant as long as there are events in F_{i-1} for all values X can take. And of course, there is the case in which E[ X | F_{i-1} ] can be constant if E[X] is constant, when X is independent of F_{i-1} (since then, from conditional expectation rules, E[ X | F_{i-1} ] will simply be E[X]).

Putting everything together: for a probability space (Ω,F,Pr) with a filtration F_0 ⊆ F_1 ⊆ …, and a sequence of random variables X_0, X_1, …, so that for all i≥0, X_i is F_i-measurable. Then, the sequence X_0,… X_n is a martingale, as long as for all i≥0,

E[ X_{i+1} | F_i ] = X_i

Any subsequence of a martingale is also a martingale, relative to the corresponding subsequence of the filter. Considering that, and knowing that X is a random variable over the probability space, then is possible to define:

X_i = E[ X | F_i ]

and it will follow that the sequence X_0,… X_n is also a martingale. This is how you construct a martingale from a random variable, and is also known as a Doob martingale. The key point is that each for each 0≤i≤n, X_i is F_{i-1}-measurable.

 

So, I can treat a random variable as a Martingale. What can I do with that?

You could do this. And while you are at it, you can consider the following:

  • All martingale properties are applicable to this martingale sequence.
  • The expectation variation bounds with martingales discussed in the previous post are applicable. Those are powerful bounds since they are not so restrictive.
  • Constructions of martingale differences is possible by having Y_i = X_i – X_{i-1} and requiring E[Y_{i+1} | F_i] =0

But if you want some examples, we can give you some ideas to build a relaxation of your real application.

Polya’s Urn Scheme:

Consider an urn that initially contains b black balls and w white balls. We perform a sequene of random selections from this urn, where at each step the chosen ball is replaced with c balls of the same colour. Let X_i denote the fraction of black balls in the urn after the ith trial. Show that the sequence X_0, X_1… is a martingale.

To prove this, we will use a property of martigales, by which for all i≥0, E[X_i] = E[X_0], and we will first take the trivial case, and use it to build a general case.

First, let’s take the trivial case. The value of X_o is a constant depending on b and w, which is which is b/(b+w). Since the expected value of a constant is the constant, then E[X_o] = X_o .

Now, let’s build an expression for the general case of X_i. Since at each trial we replace the chosen ball with c balls of the same colour, then on every ith step we add in total (c-1) balls into the urn. This means that in any given point i, we will have in total:

b+w+i*(c-1)

balls. X is the fraction of black balls in a given time i, so that numerator becomes itself a random variable depending on the expected value of black balls at point i. Then, we can say that at a certain point i we will add (c-1) black balls with probability E[X_{i-1}]. We use E[X_{i-1}], since the fraction of balls in the urn (X) is a probability, but since the value of such probability is a random variable itself, then we talk expected values of that random variable. In this way, the numerator becomes:

b + expected value of black balls at time i

which is:

b + Σ (number of black balls added on each i) * (probability of adding a black ball at i)

which for all i>0 is:

b + Σ (from 0 to k=i-1) (c-1) * E[X_k]

Now putting numerator and denominator together:

E[Xi] = (b + Σ (from 0 to k=i-1) (c-1) * E[X_k]) /  (b+w+i*(c-1))

This expression holds for i>0. Let us first verify that this expression holds for i=1:

 

E[X_1] = (b + Σ (from 0 to k=0) (c-1) * E[X_0]) /  (b+w+(c-1))

E[X_1]= (b+ (c-1) * b/(b+w) ) / (b+w+c-1) = ((b*(b+w)+b*(c-1)) / (b+w)) / (b+w+c-1)

= (b*(b+w+c-1)) / ((b+w)*(b+w+c-1)) = b/(b+w)

So, E[X_1] = E[X_0]. For the nth case:

 E[X_n] = b*(b+w+n*(c-1)) / (b+w)*(b+w+n(c-1)) = b/(b+w)

Again, E[X_n] = E[X_0]. Which is what we intended to demonstrate. Oh yeah. You can look at other examples for inspiration in 18.2.1 of here, the whole presentation here, and something a bit more hardcore yet well explained here.

Enjoy! Now that you read the post, you may feel like this video.

The featured image of the title comes from here.

Useful rules of thumb for bounding random variables (Part 2)

In the previous post  we looked at Chebyshev’s, Markov’s and Chernoff’s expressions for bounding (under certain conditions) the divergence of a random variable from its expectation. Particularly, we saw that the Chernoff bound was a tighter bound for the expectation, as long as your random variable was modeled as sum of independent Poisson trials. In this post, we will expand to the cases in which our random variables cannot be modeled as sums of independent random variables (or in the case in which we do not know if they are independent or not, or even their distribution). For that, this post will be on two inequalities involving Martingales: Kolmogorov-Doob and Azuma.

Martingales have attracted more than one, especially those related to gambling (for some anecdotal cases, check here). The martingales we refer to in this post, are sequences of random variables. So, for a sequence X_0, X_1,… if for all i>0:

E[X_i | X_0,…,X_{i-1}] = X_{i-1}

And this is known as a martingale sequence. Consequently, in martingale sequences, E[X_i] = E[X_0] for all i≥0. Taking of X_i as the capital of a gambler playing a fair game after the ith bet. In a fair game, both players have the same chance of winning, so the expected gain or loss from each bet is zero. So even if sometimes you win and sometimes you lose, gains and losses should cancel each other and converge at the initial value. Hence the phrase “quit while you are ahead”.

When we get a sequence with the following property:

E[X_i |X_0,…, X_{i-1}] ≥ X_{i-1}

then the expected value of our capital would either remain or grow with each bet. It doesn’t mean we will never lose, it only means that time is in our side. And such sequences are known (unsurprisingly) as super-martingales.

The concept becomes even more appealing to gamblers when we define martingales in terms of net gains or losses from  the ith bet. So, being Y_i=X_i-X_{i-1} (or the difference between the capital of the gambler in bet i and in bet i-1) and X_i = X_0 + ∑Yj (from j=1 until j=i), we can get something called a sequence of differences (also named martingale differences). In martingale differences, it holds that for all i≥1:

E[Y_i | Y_1,…,Y_{i-1}] = 0

There is a way to construct martingale sequences from any random variable (also known as Doob martingales). It is also possible to convert super-martingales into martingales. But maybe that is a matter to talk about in another post. Maybe titled “Martingales are awesome”.

Now, lets get into the matter of our expectation bounds. Now, how would a person deciding where to invest its capital feel if he/she could have an idea of when to “pull out” of the investment (a.k.a the time i of expected maximum capital during the cycle). Or if he/she has provisioned enough to survive the roughest time of the investment?.

Kolmogorov-Doob bound:

Let X_0, X_1,… be a martingale sequence. Then for any λ > 0 :

Pr[max_{0≤i≤n} X_i ≥ λ] ≤ E[|X_n|] / λ

This form of Kolmogorov-Doob tells us that for a capital X, the probability of having maximum capital ≥ λ in any bet i of the whole game (a.k.a. that moment in which you are on fire) is bounded by E[|X_n|]/λ.

Notice that E[|X_n|] is the sum of the product of all possible positive AND negative values times their probabilities (shamelessly citing wikipedia here). Following that, if there was an equal probability of X_n taking any value between X_n and -X_n, then E[|X_n|] would become 0 and supporting myself on |E[X]| ≤ E[|X|], and being |E[X]| non zero if X0 is nonzero, then it must apply that in a martingale, the probabilities of having any specific value of X_n between X_n and -X_n are not the same for every value, unless (possibly) X_0 was 0 -interestingly, I have not found explicit restrictions over the value of X_0-. If you wanted to be pessimistic and lazy, you could use |E[X_n]|/ λ = |X_0|/λ as your bound instead -if the worst case scenario does not look so bad and you have no other information, you may as well dwell with care into it-. You may optionally scream “YOLO” for courage.

Azuma’s bound:

Can be found in other textbooks as Hoeffdings’s inequality. Let X_0, X_1,… be a martingale sequence so that for each for each k:

|X_k – X_{k-1}| ≤ c_k

where c is independent of k. Then, for all t≥0 and any λ >0,

Pr[|X_t-X_0| ≥ λ] ≤ 2e^( (-λ^2) /(2*∑(c_k)^2 (for k=1 to k=t) ) )

This bound over martingales resembles the exponential bounds in Chernoff’s inequality. It basically tells us that the probability of X deviating more than λ from its initial value in a bet t, is bounded by an exponential function, given that the successive differences in each bet are less than or equal to a certain quantity c. The process of applying this bound to a martingale sequence is also known as the method of bounded differences.

To apply any of the former methods you would have to of course do your homework and find out as much information as you can on the system from which the martingale is generated. Sometimes it can be useful to build the martingale from a relaxation of the system, make your calculations over it, and see if your relaxation is harder than how the system is more likely to behave. And there are many other useful bounds over martingales that are not in this post. In other words, if you can possibly model a system’s behaviour with martingales, you can get a much better idea of the system without necessarily relying on experimental data (see Polya’s urn problem in our next post, “Martingales are Awesome”).

 

Most information in this post was taken from section 4 of the Randomized Algorithms book from R. Motwani and P. Raghavan, and from sections 7 and 12 from the Probability and Random Processes book from Grimmett and Stirzaker.

Book Chapter Review: If your model is mis-specified, are you better off?

https://stocksnap.io/photo/HN6OJPDCXD

This post is my interpretation of Chapter 10 of the book “Advanced Data Analysis from an Elementary point of view“. It is one of the most interesting reads I have found in quite some time (together with this).

Actually, the original title for the post was “Book Chapter review: Using non-parametric models to test parametric model mis-specifications”. But shorter titles tends to attract more viewers.

The first question that one might ask is “If we are going to use a non-parametric model to test a parametric model, why not going straight to non-parametric modelling instead?”.  Well, there are advantages to having a parametric model if you can build one. It could be of interest for your application to express your process in terms of known mathematical models. Also, a well specified parametric model can converge faster to the true regression function than a non-parametric model. Finally, if you only have a small number of samples, you could have better predictions by using a parametric model (even if slightly mis-specified). The reason is simply because parametric models tend to have a significantly smaller bias than non-parametric models. Also, for the same number of samples, a well specified parametric model is likely to have less variance in its predictions than a non-parametric model. Now, if you have a cheap and fast way to obtain many more samples, a non-parametric model can make better predictions than a mis-specified parametric model.

This is a consequence of the bias-variance trade-off that the author explains in a way that a person without a background in statistics can understand (in chapter 1.1.4 of the same book).

Non-parametric models “have an open mind” when building a model, while parametric models “follow a hunch”, if you will. One can find some similarities between modelling (parametric, non-parametric) and search algorithms, more specifically in uninformed search (BFS, DFS, Iterative deepening and variants) vs informed search (say, A* and variants). Even with a relaxed (and admissible) version of the optimal heuristic, A* is expected to traverse a shorter path than any uninformed search algorithm. However, it will traverse a larger path if the heuristic is poorly constructed, and most likely be outperformed by uninformed search. With this in mind, another question that may tickle you is: Can one translate a modelling problem into a search problem and have a machine automatically and optimally find the best parametric model for a given problem? Oh, yes (you can leave that as background music for the day if you like). You will of course need to express the problem in a way that the program terminates before our sun dies, and an admissible heuristic also helps. But yeah, you can do that. Humans often solve harder problems than they give themselves credit for.

If you were to build such a machine, the author can give you some suggestions to check for errors in your model. For instance, he suggests that if you have a good reason to think that the errors in a model can ONLY come from certain mis-specifications (say, that instead of being Y= θ1 X + ε it can only be Y=θ1 X + θ2 X + ε or maybe a couple other forms) then it may probably be faster and less sample-hungry for you to simply check whether the estimated θ2 is significantly different from 0, or whether the residuals from the second model are significantly smaller than those from the first. However, when no good reason is available to argue for one or other source of mis-specification, you can use non-parametric regression to check for all sources by doing either one of the following:

  • If the parametric model is right, it should predict as well as, or even better than the non-parametric one. So, you can check if the difference between Mean Squared errors of the two estimators is small enough.
  • If the parametric model is right, the non-parametric estimated regression curve should be very close to the parametric one. So, you can check that the distance between the two regression curves is approximately zero in all points.
  •  If the parametric model is right, then its residuals should be patternless and independent of input features. So, you can apply non-parametric smoothing to the parametric residuals and see if their expectation is approximately zero everywhere.

For the last method, I can elaborate on the explanation from the author. If the residuals are Y-f(x;θ), then the expected value of the residuals given an input X is E[Y-f(x;θ)|X] (the author did not made it explicit, but I assume that the residuals must be calculated with x ⊆ X. I could be wrong on this, so please correct me if I am). Now, being our typical regression model something in the lines of Y=f(X;θ)+ε, we substitute this in the expectation, and we end up with E[f(x;θ)+ε-f(x;θ) | X]. In this expression, we have that f(x;θ)+ε-f(x;θ) = ε, so you end up with E[ε|X]. Since the constant is independent from X, then the expression becomes E[ε|X] = E[ε]. Since the expected value of a constant ε will always be the constant, then E[ε]=ε. And with a significantly small enough ε, we can say that ε ≅ 0, so no matter what input X we have, the expected value of the residuals of the predictor should be approximately equal to zero.

So yes, under some circumstances (too little samples) you can actually be better off with a slightly mis-specified (i.e. relaxed) model, than with a full non-parametric model. And yes, you can indeed check if the assumptions for your model are actually valid.

Have fun, and see you in the next post!