Beating the benchmark with one line of bash

  |   Source

According to Kaggle R, Matlab and Python are the favorite languages among competition winners. Is it too crazy trying to win a Kaggle competition using bash? It probably is, bash doesn't even support floating point aritmetic! But still, it is possible to beat the benchmark, and in fact this might be a record for the shortest BtB ever:

{ echo "Event,WhiteElo,BlackElo";paste -d, <(seq 25001 50000) <(yes $(yes $(grep "Elo" data.pgn | cut -f2 -d'"' | sort | tail -n $(grep "WhiteElo" data.pgn | wc -l) | head -n 1) | head -n 2 | paste -sd, -) | head -n 25000); } > submission.csv

No idea what this does? Read on!

The competition

In the Finding Elo competition you are asked to predict the ELO ratings of chess players by looking at their games:

A total of 50,000 games are provided in portable game notation (pgn) format. Each game in the training set (the first 25,000 games) has both the white and black Elo rating. Each game in the test set (the latter 25,000 games) omits the Elo ratings, which you must predict. Player ids were scrubbed from the data. The goal is to predict the Elo rating based on only a single game.

A typical game looks like this:

[Event "1"]
[Site "kaggle.com"]
[Date "??"]
[Round "??"]
[White "??"]
[Black "??"]
[Result "1/2-1/2"]
[WhiteElo "2354"]
[BlackElo "2411"]
>1. Nf3 Nf6 2. c4 c5 3. b3 g6 4. Bb2 Bg7 5. e3 O-O 6. Be2 b6 7. O-O Bb7 8. Nc3 Nc6 9. Qc2 Rc8 10. Rac1 d5 11. Nxd5 Nxd5 12. Bxg7 Nf4 13. exf4 Kxg7 14.Qc3+ Kg8 15. Rcd1 Qd6 16. d4 cxd4 17. Nxd4 Qxf4 18. Bf3 Qf6 19. Nb5 Qxc3
1/2-1/2

Beating the benchmark

The organizers had a (admitelly easy to beat) benchmark based on the mean of all games played. But if we take a look at the data distribution we see that it is slightly skewed toward higher ELOs:

This says that the median should be a better representation of the data than the mean. Those are good news as the median is easier to calculate using bash.

Cracking the code

First we need to extract the ELO ratings:

$ grep "Elo" data.pgn | cut -f2 -d'"' | head
2354
2411
2523
2460
...

Now for the median we sort the list and find the value at the middle. The index of this value is:

$ sorted_list=$(grep "Elo" data.pgn | cut -f2 -d'"' | sort)
$ median_idx=$(grep "WhiteElo" data.pgn | wc -l)
$ echo $median_idx
25000

We can now retrieve the ELO at this index with:

$ echo $sorted_list | tail -n $median_idx | head -n 1
2270

This seems to be right according to the histogram.

The final step is creating the submission file. It must have the following format:

Event,WhiteElo,BlackElo
25001,0,0
25002,0,0
25003,0,0
...
49999,0,0
50000,0,0

So we need to replicate the 2270 value two times (one for WhiteElo and one for BlackElo) and then 25000 times row wise. For this we use a nice trick:

$ yes 2270 | head -n 3
2270
2270
2270

The rest is some yes-paste magic and some process substitution to put it all together. This is the same code in a more readable way:

elos=`grep "Elo" data.pgn | cut -f2 -d'"'`
sorted_list=`echo $elos | sort`
median_idx=`grep "WhiteElo" data.pgn | wc -l`
median=`echo $sorted_list | tail -n $median_idx | head -n 1`
submission_ids=`seq 25001 50000`
row=`yes $median | head -n 2 | paste -sd, -`
columns=`yes $row | head -n 25000`
submission=`paste -d, <(echo $submission_ids) <(echo $columns)`
{ echo "Event,WhiteElo,BlackElo";echo $submission} > submission.csv
Comments powered by Disqus