Columbia Interview Question

Introduction

I once heard two people at my university discussing a math problem. They claimed that a friend of theirs was asked, during a Skype interview for admission to Columbia's Computer Science PhD program, the following question:

"Suppose the following experiment: we draw random samples, one at a time, from a uniform distribution. The experiment ends when we draw a sample that was greater from the last. What is the expected time $n$ that the experiment will run for?"

Now, this is a very interesting question. On the surface, it looks very simple and easy. For simplicity, let's assume that the probability distribution is $\mathcal{U}(0, 1)$. Essentially this means that at each step we are picking a random number in the range $[0, 1]$. The mean of this distribution is $\mu = 0.5$. So, if we choose a sample at random, it is most likely to be close to $0.5$.

First sample at 0.5

Figure 1: Expected placing for the first sample based on the mean $\mu$.

So what about the second sample we draw? Well, since it is uniformly random, a sample has a $50\%$ chance of being greater than the mean $\mu$, and a $50\%$ chance of being smaller than the mean $\mu$.

Second sample unknown

Figure 2: Two possible placings for the second sample.

It seems pretty possible that the experiment could end at $n = 2$, right? And even if it didn't end at the second sample, it seems very likely it would end, on average, at the third sample. This is because, sice $x_2$ is smaller than $x_1$, the values for $x_3$ that would allow for the experiment to continue are even fewer.

Third sample small

Figure 3: Possible placings for the third sample.


To test our theory, we can write a small Python program that will perform a few such experiments and see the results:
import random
t = 1000000

winning = []
for idx in range(t):
    prev = random.random()
    nxt = random.random()
    counter = 2
    while (prev >= nxt):
        prev = nxt
        nxt = random.random()
        counter += 1
    winning.append(counter)

print "Average: " + str(sum(winning) / float(t))
In the above code, we run $t = 1000000$ iterations of the experiment and stored in a list the value $n$ at which each experiment ended. By running the above code, we get an average value of $2.719059$. Our intuition was correct! Many of the experiments will end at the second draw, and the ones that go on after the third draw are very limited. So why is this question interesting? Because deriving the result formally, and providing an exact answer, is a lot more intricate.

Formalising the problem

Let us model the experiment using a random variable $X$. The random variable $X$ denotes the probability of the experiment ending after $n$ draws. Since we need to draw at least two samples for the experiment to end, it follows that $n \geq 2$. The answer to the question is essentially the expected value of $X$   $$ E[X] = \sum_{n=2}^{\infty}n \cdot P[X = n] $$ Let's examine what exactly is the value of $P[X = n]$. To do that, let's denote the samples as $x_1, ..., x_n$. Then, the sample $x_2$ needs to be smaller than $x_1$, $x_3$ needs to be smaller than $x_2$, and so on, with $x_n$ being greater than $x_{n-1}$. Formally $$ \begin{align} P[X = n] &= P[x_n > x_{n-1}, x_{n-1} < x_{n-2}, x_{n-2} < x_{n-3}, ..., x_{2} < x_{1}]\\ &= P[x_n > x_{n-1}, x_{n-1} < x_{n-2} < x_{n-3} < ... < x_{2} < x_{1}]\\ \end{align} $$ since we need the first $n-1$ samples to be in a descending order, and the $n-$th sample to be greater than the last. Still, this probability seems incomputable. We are going to change that. Using the Law of Conditional Probability we can break the above probability into two parts $$ \begin{align} P[X = n] &= P[x_n > x_{n-1}, x_{n-1} < x_{n-2} < x_{n-3} < ... < x_{2} < x_{1}]\\ &= P[x_n > x_{n-1} | x_{n-1} < x_{n-2} < x_{n-3} < ... < x_{2} < x_{1}] \cdot P[x_{n-1} < x_{n-2} < x_{n-3} < ... < x_{2} < x_{1}]\\ \end{align} $$ Now we have two probabilities to compute; however, they are significantly easier to compute: Combining these two results, we can now compute the original probability $$ P[X = n] = \frac{n-1}{n} \cdot \frac{1}{(n-1)!} = \frac{1}{n} \cdot \frac{1}{(n-2)!} $$ and then the expected value of the random variable $X$ is $$ E[X] = \sum_{n=2}^{\infty}n \cdot P[X = n] = \sum_{n=2}^{\infty}n \cdot \frac{1}{n} \cdot \frac{1}{(n-2)!} = \sum_{n=2}^{\infty} \frac{1}{(n-2)!} $$ Let us perform a change of variable, so that the above expression is easier to recognise. Let $k = n-2$, then the above becomes $$ E[X] = \sum_{k=0}^{\infty} \frac{1}{k!} = e $$ where $e$ is the mathematical constant.

Discussion

I find this problem very interesting for a few reasons:

1:  This is not trivial to understand or to prove. Intuitively, since we have taken $n-1$ uniformly distributed samples, the probabilistic assumption is that they will be uniformly distributed. Then another sample taken from that uniform distribution is equally likely to fall in each of the bins created by the other samples. This can be formally proven using mathematical induction, but it is cumbersome.