Stack Exchange tag predictions

  |   Source

I recently had the chance to participate in a Kaggle competition. Because it was my first one I made a lot of mistakes which ended costing me a lot of time. I had actually started to play with the Titanic challenge before, but my motivation was lost when I saw some competitors with close to 100% precision in the leaderboard. This ruined my competitive spirit as there was no way to measure the achievable accuracy for the competition.

The formulation of the problem was actually quite simple: Predict the tags that a user would choose for a specific Stackexchange question given only the question text and its title. Participants were given a training dataset with approx. 6 million rows with tags for each question and a test dataset with approx. 2 million rows without the tags where the prediction was to be made.

The first thing I did when I decided to enter the competition was to look for previous research. There was in fact a lot of information. But particularly interesting was the Task 1 (Content based recommendation) of the PKDD Discovery Challenge 2009. In the proceedings one can find many different approaches to the tag recommendation problem based only on the content of the post, i.e. no information about user, resource or tags are is available in the training data. Unfortunately I got intimidated by the size of the dataset in Kaggle's competition and discarded many of the most attractive approaches that I found there. In retrospective this was a big mistake, in the end the data was not really that big (less than 10GB) and also I could have tried extracting a representative sample of the data small enough to fit in RAM.

It was clear that in order to obtain predictions for a given post, some set of tags had to be extracted and ranked. Then some sort of threshold could be established to select the final tags of the submission. The set of all tags in the training set (42 000 total) looked like a good candidate to be ranked, but the idea comes with an attached performance trade-off.

A second problem was the tokenizing and coding part of the text information. My first approach was to extract meta-features like has_code, is_html_tag, is_email, etc. I used this "improved" representation of the documents (plus all the usual bag of words features) to make similarity queries to each one of the documents in the test set (bow⇒ tf-idf⇒ lsa). From the top most similar documents in the training set I extracted the tags and combined them to make my prediction. This made more sense to me than just blindly clustering as it would be difficult to determine the right number of clusters. Nevertheless I tried both methods in a small validation set and didn't get the results I was expecting (under 0.70 f1-score). I still think this approach has potential and needs to be further explored.

The solution

Memory was the main bottleneck in this competition. To play around this limitation, most of the data structures had to be calculated in a streaming fashion and immediately afterwards be serialized to disk and deleted from memory. I used two helper functions, a function generator to stream lines one at a time (because the whole data wouldn't fit in my 8GB memory) and a tokenizer that generates valid Stackexchange style tags, i.e. all characters in the set [a-zA-Z0-9_+-.#] excluding dots at the end of the string:

In [ ]:
import csv
from itertools import islice

def line_stream(filename, stop=None):
    """Streams `filename` one line at a time, optionally stopping at `stop`"""
    with open(filename) as csvfile:
        next(csvfile, None)  # skip header
        for line in islice(csv.reader(csvfile, delimiter=',', quotechar='"'), stop):
            yield line
def title_tokenize(s, stopwords=STOPWORDS):
    """Extract valid SO tags style tokens, from string `s` excluding tokens in `STOPWORDS`"""
    return [token for token in re.findall(r'\b\w[\w#+.-]*(?<!\.$)', s.lower())
                    if token not in stopwords]

I adapted the winning solution of the ECML PKDD Discovery Challenge 2009 by Evangelio Milos et. al. and added some small improvements from other solutions given there. The algorithm goes like this, there are three basic tag recommenders which get combined to form the final submission, A basic recommender, a title ⇒tag recommender and a tag ⇒tag recommender.

Basic recommender:

Words from the title scored according to their ''usefulness'' i.e. how many times from the total appearances each word leads to a correct tag (co-occurrence). The title is a natural summarization of the post's content, which means it plays a similar role as the tags. This is a very precise recommender but has a low recall, so the set of tags has to be extended by the title ⇒ tag and the tag ⇒ tag recommender.

In [ ]:
from collections import Counter

def basic_recommender(title, usefulness=usefulness):
    return Counter({word: usefulness.get(word, 0)
                  for word in title_tokenize(title)})

title = "R Error Invalid type (list) for variable"
# Counter({'r': 0.5847612568561589, 'list': 0.1565180102915952,'invalid': 0.01734960767218832, ...

The usefulness is just a dictionary that can be built with:

In [ ]:
from collections import defaultdict

total, common, usefulness = defaultdict(int), defaultdict(int), defaultdict(int)

for line in line_stream(data_dir + 'Train.csv'):
        ID, title, body, tags = line
        title = title_tokenize(title)
        tags = set(tags.split())
        for word in title:
            total[word] +=1
            if word in tags:
                common[word] += 1
for word in total:
    usefulness[word] = common[word] / total[word]

Tag ⇒ Tag recommender:

This reproduces the fact that some tags are more likely to be found together, so I built a tag->tag graph to expand the tags from the basic recommender. First we must construct the graph:

In [ ]:
import networkx as nx
from itertools import combinations

G = nx.Graph()

for line in line_stream(data_dir + 'Train.csv'):
    ID, title, body, tags = line
    tags = tags.split()
    # add tags
    for tag in tags:
        if tag:
            if not G.has_node(tag):
                G.node[tag]['tag_count'] = 1
                G.node[tag]['tag_count'] += 1
    # add edges
    for edge in combinations(tags, 2):
        ni, nj = edge
        if not G.has_edge(ni, nj):
            G.add_edge(ni, nj, weight=1)
            G.edge[ni][nj]['weight'] += 1

Now we can calculate the scores via:

$$score(t_1, t_2) = \frac{occurrences({t_1\ and\ t_2})} {occurrences({t_1})}$$

where $occurrences({t_1\ and\ t_2})$ is the number of times that tags $t_1$ and $t_2$ appear together. The recommender also ranks the tags by score and selects the top n_rec. This is a parameter that needs to be tuned later:

In [ ]:
def tag2tagrecommender(tags, n_recs):
    """Given a Counter with `tags: scores` generate and score associated tags"""    
    total_scores = Counter()
    # eliminate tokens with 0 usefulness == not in the training set
    tags -= Counter()
    for tag in tags:
        tag_scores = Counter({nj: tags[tag]*G.edge[tag][nj]['weight']/G.node[tag]['tag_count'] 
                              for _, nj in G.edges(tag)})
        tag_scores = Counter(dict(tag_scores.most_common(n_recs)))  # keep best n_recs
        total_scores = combine_tags(total_scores, tag_scores)
    return total_scores

title = "Creating a repetitive node jquery html from a hash array with simplexml_load_string, a cycle and variables"
tags = basic_recommender(title)
tag2tagrecommender(tags, 3)
# Counter({'javascript': 0.485, 'jquery': 0.222, 'css': 0.183, 'html': 0.100, 'ajax': 0.086 ...

Because each tag in the title generates its own set of associated tags they have to be combined probabilistically,

$$\mathrm{score}_{combined} = 1-\prod_i{(1-\mathrm{score}_i)}$$

and this is precisely what combine_tags does:

In [ ]:
def combine_tags(c1, c2):
    """Probabilistic union of two tag sets.
    c1, c2: Counters with tags as key, probs as the values

        c1 = Counter({'php': 0.2, 'xml': 0.14})
        c2 = Counter({'php': 0.1, 'jquery': 0.1})

    for w in c1:
        if w in c2:
           c1[w] = 1 - (1 - c1[w])*(1 - c2[w])
    return Counter(dict(c2, **c1))             # add missing items from c2

Title ⇒ Tag recommender:

For each word in the title a set of tags is recommended by analyzing the corresponding nodes in a prebuilt graph Title⇒Tags having edges weights proportional to their co-occurrence. To build the graph:

In [ ]:
Y = nx.Graph()

for n, line in enumerate(line_stream(data_dir + 'Train.csv')):
    ID, title, body, tags = line
    tags = tags.split()
    title = title_tokenize(title)
    # add tags
    for word in title:          
        if word and not Y.has_node(word):        # Add word
            Y.add_node(word, is_word=1)
            Y.node[word]['count'] = 1
        elif Y.node[word].get('is_word') == 1:   # a word has already been seen
            Y.node[word]['count'] += 1           # so increment the count
            Y.node[word]['is_word'] = 1          # mark as word, next time will be counted
            Y.node[word]['count'] = 1
        for tag in tags:
            if tag and not Y.has_node(tag):  # Add tag
                Y.add_node(tag, is_tag=1)
                Y.node[tag]['is_tag'] = 1
            # Add word-> Tag edge    
            if not Y.has_edge(word, tag):
                Y.add_edge(word, tag, weight=1)
                Y.edge[word][tag]['weight'] += 1

And the scores are calculated similarly to the Tag⇒Tag recommender with:

In [ ]:
from collections import Counter

def title2tagrecommender(tags, n_recs=10):
    total_scores = Counter()
    for word in tags:
        if Y.has_node(word) and Y.node[word].get('count'):
            tag_scores = Counter({nj: Y.edge[word][nj]['weight'] / Y.node[word]['count']
                                  for _,nj in Y.edges(word) 
                                  if Y.node[nj].get('is_tag')==1})
            tag_scores = Counter(dict(tag_scores.most_common(n_recs)))
            total_scores = combine_tags(total_scores, tag_scores)
    return total_scores - Counter()

title = "Creating a repetitive node from a hash array with simplexml_load_string, a cycle and variables"
tags = basic_recommender(title)
title2tag(tags), tags
# Counter({'php': 0.970, 'xml': 0.730, 'jquery': 0.638, 'simplexml-load-string': 0.530, 'javascript': 0.527 ...

Normalize and combine recommenders:

Finally the recommendations from each recommender must be normalized and combined probabilistically. Some recommenders are more precise than others so they should be weighted accordingly. I tuned the weights using ridge regression on a subset of the data (for speed), but didn't save the optimal parameters when I shut down my EC2 instance:

In [ ]:
def normalize(rec, weight=1):
    """Normalizes each recommendation proportionally to the score of the leading tag"""
        max_score = rec[max(rec, key=rec.get)]
        for tag in rec:
            rec[tag] = weight * rec[tag] / max_score
        rec = Counter()    # empty recommendation
    return rec

def meta_recommender(recomendations, weights):
    """Probabilistically combine `recomendations` with corresponding `weights`"""
    total = Counter()
    for rec, weight in zip(recomendations, weights): 
        rec = normalize(rec, weight)
        total = combine_tags(total, rec)    # accumulate tag scores              
    return total

Selecting tags for submission:

I just used a manually tuned up threshold on the probability. In my opinion this the point where the solution can be improved the most.

In [ ]:
def select_tags(rec, threshold=0.3):
    """Select at most 5 tags with with score greater than `threshold`"""
    return [tag for tag, score in rec.most_common(5) if score > threshold]

Now to generate a submission we can do:

In [ ]:
title = "How to fetch an XML feed using <p>I've decided to convert a Windows Phone 7 app that fetches an XML feed and then parses it to an web app, using Visual Web Developer Express."

basic = basic_recommender(title)
tag2tag = tag2tagrecommender(basic, 10)
title2tag = title2tagrecommender(basic, 10)

recomendations = meta_recommender([basic, tag2tag, title2tag], [0.60, 0.33, 0.33])
# ['', 'xml', 'c#', 'windows', 'php']

Submission result

I was negatively surprised when I found out that even though all the training part of the code could run in less than an hour, the prediction part would need around 19 hours to finish. This wouldn't normally be a problem, were it not for the fact that I had only 19 hours remaining to submit my solution.

Fortunately, the algorithm was relatively easy to parallelize and thanks to Amazon EC2 it took around 3 hours to finish all predictions in a cc2.8xlarge instance (wonderful machines by the way).

In the end I was placed in the position 16 (0.78561) in the public leaderboard and 14 (0.78562) in the private leaderboard. There was almost no score change from my part, which means that the improvement in the 4 positions can be attributed to slightly overfitting solutions from other competitors. All in all it was a great experience and I fully intend to participate again.

Comments powered by Disqus