Predicting hispanic origin using names

  |   Source

How much can your name tell about you? I was curious to find out wether it is possible to predict the etnicity of a person, just by looking at its name. I know a lot about hispanic names so lets try to make a model that predicts hispanic origin.

Getting the data

After a quick literature search I found that the state of the art on surname's etnicity is given by an article published by the U.S. Census Bureau: Demographic Aspects of Surnames from Census 2000 by D.L. Word, C.D. Coleman, R. Nunziata and R. Kominski. The digital version of the article comes with two accompanying files with the data but the link inside the census.gov website is broken and to my surprise I could't find this csv anywhere. (Update: Found it)

Fortunately Mongabay has a html copy. While the times of writing scripts to scrape webpages are soon coming to an end (Parsehub, Kimono), in this case it is relatively easy to get the data.

In [1]:
base_url = 'http://names.mongabay.com/data'
pages = ['{}/{}.html'.format(base_url, i) for i in range(1000, 50000, 1000)]
pages[:3]
Out[1]:
['http://names.mongabay.com/data/1000.html',
 'http://names.mongabay.com/data/2000.html',
 'http://names.mongabay.com/data/3000.html']

I find pyquery simpler to use but beautiful soup is also an alternative.

In [2]:
from pyquery import PyQuery as pq

names, percents = [], []
for url in pages:
    d = pq(url)
    names.extend([e.text() for e in d.items('.boldtable td:nth-child(1)')][1:])
    percents.extend([e.text() for e in d.items('td:nth-child(11)')][1:])

len(percents), len(names)
Out[2]:
(49057, 49057)

Data cleaning

There are close to 49k names and each name has an associated "hispanic percentage". This represents the number of times a specific name is used by someone who has reported hispanic origin. There also some "fields suppressed for confidentiality" missing from the percent column so we drop them:

In [3]:
import pandas as pd

df = pd.DataFrame({'name':names, 'percent': percents})
df = df[df.percent != '(S)']             # eliminate missing values
df.percent = df.percent.astype(float)
df.head()
Out[3]:
name percent
0 SMITH 1.56
1 JOHNSON 1.50
2 WILLIAMS 1.60
3 BROWN 1.64
4 JONES 1.44

In order to create our training set we have to select positive(hispanic) and negative(non-hispanic) examples. Lets take a quick look at the data distribution:

In [167]:
df.percent.hist(bins=100)

From here and visual inspection on the names it seems that we can choose percent>75 and percent<10 as good criteria for separating both groups. We reduce the threshold for negative examples from 10 to 0.85 in order to get a better balanced dataset.

In [4]:
def is_hispanic(x):
    return 1 if x.percent > 75 else 0 if x.percent < 0.85 else None

df['class'] = df.apply(is_hispanic, axis=1)
df.dropna(inplace=True)                   # eliminate remaining examples

print df['class'].value_counts()          # number of examples in each class
df.tail()
0    2845
1    2603
dtype: int64
Out[4]:
name percent class
48938 PICASO 83.04 1
48981 DEKAM 0.00 0
49017 MATKOVICH 0.00 0
49033 RAUDALES 94.42 1
49036 RUZ 77.92 1

Looks good, Picaso and Ruz sound pretty hispanic to me.

Another approach would be to regress the percent column. This has the advantage that it automatically encodes the percent information in the loss.

Feature engineering

The most important feature given the data would be the percent column. For names not in the table we would have to find a way to impute the missing values, I think that assigning 0 percent would be a reasonable estimate. For these cases to be correctly classified some feature ideas come to mind:

  • first and last character
  • length of the name
  • ratio of consonants to vowels, number of consonants, etc

But probably the best features will be found within character n-grams(including single characters). For example you would never find "ich" in a spanish word which means that names such as "MATKOVICH" can automatically be clasified as non hispanic. Also some letters like k or w are really rare. If we wanted to get really crazy we could also add n-grams-position combinations like ich-at_end_of_name etc. For this quick experiment we will only use n-grams, this is how our feature matrix looks:

In [43]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 4), max_features=30)
X = vectorizer.fit_transform(df.name)
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).head()
Out[43]:
a al an ar b c d e en er ... p r ra re s t u v y z
0 2 0 0 1 0 1 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 1 0 0 ... 0 2 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 1 0 0 ... 0 1 0 0 0 1 0 0 0 1
3 1 0 1 0 0 0 1 2 0 1 ... 0 1 0 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 1 0 0 ... 1 0 0 0 0 0 0 0 0 1

5 rows × 30 columns

It is always good to have an idea of how much memory our data requires. In this case we have nothing to worry about, for example using the top 10000 features only takes about 1MB of memory.

In [42]:
def sparse_size_mb(X):
    """Return the size in MB of a sparse matrix
    """
    return (X.data.nbytes + X.indptr.nbytes + X.indices.nbytes)  / 1024.0**2

sparse_size_mb(X)
Out[42]:
1.2333450317382812

Model training

We chose a rbf Support Vector Classifier for our model. Lets take a quick look at the precision score:

In [3]:
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
model = SVC().fit(X_train, y_train)
model.score(X_test, y_test)
Out[3]:
0.88876529477196886

Almost 90 percent precision, not bad for a non optimized model. There is a very promising project for hyperparameter optimization with sklearn integration called hyperopt-sklearn, but at this point it is still in a very alpha stage. So in order to find the best hyperparameters for the model we make a randomized crossvalidation search on 10000 points. I also use a Pipeline in to be able to optimize the number of features in the char vectorizer:

In [25]:
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import Pipeline
import scipy

pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 4))),
    ('tfidf', TfidfTransformer()),
    ('svc', SVC()),
])

param_dist = {'vect__max_features': (None, 100, 1000, 10000),
              'svc__C': scipy.stats.expon(scale=100),
              'tfidf__use_idf': (True, False),
              'svc__gamma': scipy.stats.expon(scale=.1)}

np.random.seed(0)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_jobs=-1, n_iter=10000)
X_train, X_test, y_train, y_test = train_test_split(df.name, df['class'], test_size=0.33, random_state=0)
random_search.fit(X_train, y_train);

Now lets take a look at the best models:

In [44]:
from operator import itemgetter
import numpy as np

top_scores = sorted(random_search.grid_scores_, key=itemgetter(1), reverse=True)[:5]
top_scores = [score.parameters.values() + [score.mean_validation_score, np.std(score.cv_validation_scores)] 
          for i, score in enumerate(top_scores)]
pd.DataFrame(top_scores, columns = score.parameters.keys() + ['mean_validation_score', 'std_validation_scores'])
Out[44]:
svc__gamma tfidf__use_idf svc__C vect__max_features mean_validation_score std_validation_scores
0 0.091623 False 48.995986 10000 0.924658 0.007397
1 0.072100 False 54.169315 NaN 0.924658 0.006064
2 0.141401 False 29.835048 10000 0.924384 0.007447
3 0.061467 False 77.135588 NaN 0.924384 0.006686
4 0.065278 False 64.438253 10000 0.924384 0.006376

The results seem to be quite consistent. The optimal region is found at a value of the regularization parameter C in the order of $10^1$, gamma around $10^{-2}$ it is better to disable inverse document frequency reweighting, and the number of features should be large enough, 10000 or even the whole vocabulary. Lets check the score on our test set:

In [46]:
random_search.best_estimator_.score(X_test, y_test)
Out[46]:
0.92436040044493883

No surprises here.

Prediction

Lets try a couple names:

In [9]:
test_names = ['SMITH', 'MARTINEZ', 'BUFFET', 'PICASSO']
model = random_search.best_estimator_
pd.DataFrame([model.predict(test_names)], columns=test_names)
Out[9]:
SMITH MARTINEZ BUFFET PICASSO
0 0 1 0 1

Want to try yours? I have setup a quick demo here.

Comments powered by Disqus