Numpy programming challenge

  |   Source

In Coursera's Computational Investing Prof. Tucker Balch, who was always looking for ways to make the course more interesting, proposed the following challenge:

Write the most succinct NumPy code possible to compute a 2D array that contains all "legal" allocations to 4 stocks:

  • "Legal" means: The allocations are in 0.10 chunks, and the allocations sum to 1.00
  • Only "pure" NumPy is allowed (no external libraries)
  • Can you do it without a "for"?"

so we had to generate an array like this one:

    [ [ 0.0,  0.0,  0.0,  1.0],
      [ 0.0,  0.0,  0.1,  0.9],
      [ 0.0,  0.0,  0.2,  0.8], #<---the sum(row) should be 1
      [ 0.0,  0.0,  0.3,  0.7],

I love both challenges and Python so I immediately set about to try and solve it. Prof. Balch proposed a "brute force approach" as a starting point, because he "wanted folks to have something to beat :-).".

In [83]:
# solution by Tucker Balch (brute force approach)

import numpy as np
allocs = np.zeros((10000,4))
n = 0
for i in range(0,11):
    for j in range(0,11):
        for k in range(m
            for l in range(0,11):
                row = [i/10.0,j/10.0,k/10.0,l/10.0]
                if sum(row)==1.0:
                    n = n+1
allocs = allocs[0:n,:]
1 loops, best of 3: 284 ms per loop

As the first solutions quickly started showing up, I decided to collected them and benchmark them for comparison. In retrospective I might have twisted the original challenge a bit, as the goal changed from conciseness to performance, but everybody seemed to play along and I think in the end, it made the challenge more interesting.

My first solution was very concise but I wasn't playing by the rules (no numpy) and also it was almost as slow as the brute force approach:

In [12]:
allocations = list(ifilter(lambda x: abs(sum(x)-1)<1e-8, product(linspace(0, 1, 11), repeat=4)))
1 loops, best of 3: 282 ms per loop

After some further testing I found a quite concise and fast numpy solution I was really happy with:

In [70]:
from numpy import linspace, rollaxis, indices

x = linspace(0,1,11)[rollaxis(indices((11,) * 4), 0, 4+1).reshape(-1, 4)]
valid_allocations = x[abs(x.sum(axis=1)-1)<1e-8]
1000 loops, best of 3: 1.29 ms per loop

Being over 200 times faster than Prof. Tucker solution, only two lines long and entirely numpy based (numpy is blazing fast right?) I thought this would be quite difficult to beat. But shortly aftwerwards the following happened:

Pawel Kozak (recursive no numpy)
In [69]:

def y(last, i=0, j=0, k=0, l=0):
    if len(a)==0:
        a.append([10,0, 0, 0])
        a.append([last[0]+i,last[1]+j,last[2]+k, last[3]+l])
    last = a[-1][:]
    if last[0]>0:
        y(last, i=-1, j=1, k=0, l=0)
        if last[1] ==0:
            y(last, i=-1, j=0, k=1, l=0)
            if last[2] ==0:
                y(last, i=-1, j=0, k=0, l=1)

g = [[a[j][i]/10.0 for i in range(4)] for j in range(len(a))]
1000 loops, best of 3: 701 us per loop

A recursive pure Python solution almost 2 times faster than numpy! Certainly my numpy algorithm wasn't being optimal. This solution remained in the top spot for some hours until Sergio Antoy came up with a cryptic C-like solution:

Sergey Antoy (pure Python)
In [16]:
a = [] 
i,j,k = 0,1,2

while True:
    # print [i,j-i-1,k-j-1,12-k]
    if i < j-1:
        i += 1
    elif j < k-1:
        i,j = 0,j+1
    elif k < 12:
        i,j,k = 0,1,k+1
1000 loops, best of 3: 213 µs per loop

Sergio explained later that the algorithm was based on the "The problem of the Dutch National Flag" featured by E. Dijkstra on A Discipline of Programming (Disclaimer: no affiliate link). Now this seemed to be the end of the competition. After all what can be faster than a textbook algorithm by the great Dijkstra? Well, a simple nested for loop:

Nick Poplas (pure Python)
In [18]:
allocations = [(0.1 * a, 0.1 * b, 0.1 * c, 0.1 * (10 - a - b - c)) for a in range(11) for b in range(11 - a) for c in range(11 - a - b)]
10000 loops, best of 3: 135 µs per loop

only 135 microseconds!! This is more than 2000 times faster than the initial solution! It seems that Python loops are not so slow after all. Also the simplicity of the solution was mind blowing.

At this point someone posted (as a joke) the fastest possible solution, i.e. directly assigning the pre-calculated result to a variable name,

In [13]:
allocations = [ [ 0.0,  0.0,  0.0,  1.0],
                [ 0.0,  0.0,  0.1,  0.9],
                [ 0.0,  0.0,  0.2,  0.8],
                [ 0.0,  0.0,  0.3,  0.7],
                #          ...
                [ 0.9,  0.1,  0.0,  0.0],
                [ 1.0,  0.0,  0.0,  0.0]]
10000 loops, best of 3: 64 µs per loop

so now we knew how close we had gotten the theoretical limit. Even though I felt there would be little chance to improve such a simple solution I kept trying. I imagined that only numpy's C optimized loops could be able to beat Python loops, but how to formulate the problem in a column oriented approach? And then I found this:

Yaser Martinez (numpy)
In [20]:
#from numpy import indices, vstack

x1_3 = indices((11,11,11)).reshape(3,1331)        # first three columns
x4 = 10 - x1_3.sum(axis=0)                        # 4th column
valid_idx = x4 > -1                               # filter out negative values
allocations = 0.1*vstack((x1_3, x4)).T[valid_idx] # join columns, reshape and scale   
10000 loops, best of 3: 113 µs per loop

Pure numpy power! With only 113 microseconds this was fastest solution and also the only one (from the top ranked) that is 100% compliant with all the rules, i.e: returns a numpy array without using for loops. Some time later I found that if one replaces range by xrange in Nick Poplas solution, avoiding the creation of intermediate lists, the solution also runs (coincidence?) in 113 microseconds.

Comments powered by Disqus