Book: Practical Feature Engineering (WIP)
The process of feature engineering is as much of an art as a science and is probably the least well studied part of developing prediction models. In this book I plan to fill the existing gap in the literature by compiling many examples of features I have found useful in real applications and Kaggle competitions.
A small python library that extracts country and city mentions from text. See the code
Minimal Python docker images
Realtime topic modeling of Wikipedia edits
This was a fun weekend project. A Python IRC bots connects and listens in real time to the changes broadcasted in the wikipedia edits channel. Then the messages are parsed and the corresponding target url contents are downloaded. The raw text is analized with a LDA model pretrained on the complete en-Wikipedia dump. Also the location of the author is geolocated and saved to the MongoDB. Aggregated statistics are then sent to the front end in real time, all thanks to the magic of Meteor. You can see it here (only UI, currently no updates as the server is stopped).
Technologies used: Meteor, MongoDB, gensim.
Thomson Reuters Eikon Text Tagging Challenge
In this challenge Thomson Reuters, was searching for an algorithm to accurately tag incoming news items by relevance for companies or organizations mentioned within the news item. I built a system capable of recognizing alternative company names (using DBpedia data), stock ticker based identification (Bloomber Symbiology data) and country based discrimination in the text of the news. The system has the following structure:
- Lookup tagger: Performs authorithy driven mention detection, i.e. extracts with high recall possible mentions of company names.
- Candidate generation: For each possible company mention several candidate companies are suggested
- Features generation: For each mention-candidate company generate features.
- Classifier: This component finds the correct candidates using the features.
One of the greatest challenges was to find data sources to augment the information about the list of companies complying with the accepted licenses.
In this challenge the task was to predict user's tags for each question in a big dataset from 8 million text questions. See my solution.
Because of the size of the dataset memory was the main bottleneck in this competition, so the data had to be processed in a streaming fashion and ideally also in parallel. My submission got the 14th/367 place in the final leaderboard.
A command line utility to create different plot types from csv files. Wrapped as an addition to csvkit. See the docs
The Nuclear Mass Table Toolkit
The Nuclear Mass Table Toolkit provides utilities to work with nuclear mass tables. Project page
Realtime Wikipedia edits visualization
This is a realtime Three.js based 3D visualization of the Wikipedia edits being made right now. I look for geotags in the corresponding article and represent them in a Three.js earth. I still want to make some improvements, like rotating the camera towards the last edit and adding some text information. This uses Pubnub's datastream. See demo (you may need to wait about 1 min for the markers to start showing)
War distribution visualization
Solution to the Twitter Sentiment assignment in Introduction to Data Science(Coursera). See the code
Some experiments on parsing dates from text: demo