Creating super small docker images

Docker containers are lighter than virtual machines but in many cases images are way bigger than they ought to be. For example the official docker python image is approximately 900MB in size and this is only for the python runtime with no external libraries installed.

Python itself is not small, a typical python installation needs close to 100 MB once uncompressed on the disk. Of course one could imagine that here are many files included that aren't needed in most of the usual cases (like the turtle module). Is it possible to create a smaller python docker image?

Read more…

Beating the benchmark with one line of bash

According to Kaggle R, Matlab and Python are the favorite languages among competition winners. Is it too crazy trying to win a Kaggle competition using bash? It probably is, bash doesn't even support floating point aritmetic! But still, it is possible to beat the benchmark, and in fact this might be a record for the shortest BtB ever:

Read more…

Predicting hispanic origin using names

How much can your name tell about you? I was curious to find out wether it is possible to predict the etnicity of a person, just by looking at its name. I know a lot about hispanic names so lets try to make a model that predicts hispanic origin.

Read more…

Brownian simulation of correlated assets

When using Monte Carlo methods to price options dependent on a basket of underlying assets (multidimensional stochastic simulations), the correlations between assets should be considered. Here I will show an example of how this can be simulated using pandas.

Read more…

Stack Exchange tag predictions

I recently had the chance to participate in a Kaggle competition. Because it was my first one I made a lot of mistakes which ended costing me a lot of time. I had actually started to play with the Titanic challenge before, but my motivation was lost when I saw some competitors with close to 100% precision in the leaderboard. This ruined my competitive spirit as there was no way to measure the achievable accuracy for the competition.

The formulation of the problem was actually quite simple: Predict the tags that a user would choose for a specific Stackexchange question given only the question text and its title. Participants were given a training dataset with approx. 6 million rows with tags for each question and a test dataset with approx. 2 million rows without the tags where the prediction was to be made.

Read more…

List creation performance in Python

What is the fastest way to create an initialized list in Python? I always wondered about it, so when the question was asked in Stackoverflow I saw a chance to measure it once and for all.

Read more…

Web Scraping: III

This is the third post from a series dedicated to web scraping

Some websites implement measures against scraping. A common reason is to protect the data, for example site scraping is sometimes used to gather product and pricing data to undercut rivals’ prices for goods and services (Ryanair case).

In this post I will show a technique that can be used to bypass cookie based protections. As an example we are going to download data from [Kaggle] (http://www.kaggle.com). This is useful if you want for example to download a big data file directly to Amazon EC2. Kaggle only allows browser based downloads, because you have to accept the competition rules before getting access to the data.

Read more…

Web Scraping: II

This is the second post from a series dedicated to web scraping

In some cases the data we see on the browser is generated after it has being downloaded. For example gmail servers don't send the whole page source code each time a new email comes in. In this case Javascript is responsible for properly updating the visible website (the DOM).

This is exactly what we see when we use "Inspect element" in Google Chrome or Safari. In contrast if we use the "View Source" tool, we get the original html that was downloaded for the server.

Read more…

Web scraping: I

This is the first post from a series dedicated to web scraping

There are at least three ways to analyze the html from a page. In order to scrape fairly regularly formatted data from large documents, a regular expression is the right solution and will be faster than a generic parser. But bear in mind that html is not always well constructed and as the name suggests regular expression are designed to deal with regular structures.

Read more…

Numpy programming challenge

In Coursera's Computational Investing Prof. Tucker Balch, who was always looking for ways to make the course more interesting, proposed the following challenge:

Write the most succinct NumPy code possible to compute a 2D array that contains all "legal" allocations to 4 stocks:

  • "Legal" means: The allocations are in 0.10 chunks, and the allocations sum to 1.00
  • Only "pure" NumPy is allowed (no external libraries)
  • Can you do it without a "for"?"

    Read more…