In this post, I will introduce you to something called Named Entity Recognition (NER). NER is a part of natural language processing (NLP) and information retrieval (IR). The task in NER is to find the entity-type of words. Entities can, for example, be locations, time expressions or names. If you want to run the tutorial yourself, you can find the dataset here. Now we load it and peak at a few examples.
import pandas as pd
import numpy as np
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.tail(10)
Sentence # | Word | POS | Tag | |
---|---|---|---|---|
1048565 | Sentence: 47958 | impact | NN | O |
1048566 | Sentence: 47958 | . | . | O |
1048567 | Sentence: 47959 | Indian | JJ | B-gpe |
1048568 | Sentence: 47959 | forces | NNS | O |
1048569 | Sentence: 47959 | said | VBD | O |
1048570 | Sentence: 47959 | they | PRP | O |
1048571 | Sentence: 47959 | responded | VBD | O |
1048572 | Sentence: 47959 | to | TO | O |
1048573 | Sentence: 47959 | the | DT | O |
1048574 | Sentence: 47959 | attack | NN | O |
words = list(set(data["Word"].values))
n_words = len(words); n_words
35178
So we have 47959 sentences containing 35178 different words.
We start by writing a small class to retrieve a sentence from the dataset.
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
def get_next(self):
try:
s = self.data[self.data["Sentence #"] == "Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s["Word"].values.tolist(), s["POS"].values.tolist(), s["Tag"].values.tolist()
except:
self.empty = True
return None, None, None
getter = SentenceGetter(data)
sent, pos, tag = getter.get_next()
This is how a sentence looks.
print(sent); print(pos); print(tag)
['They', 'marched', 'from', 'the', 'Houses', 'of', 'Parliament', 'to', 'a', 'rally', 'in', 'Hyde', 'Park', '.']
['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', 'TO', 'DT', 'NN', 'IN', 'NNP', 'NNP', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O']
The first simple idea and baseline might be to just remember the most common named entity for every word and predict that. In case we don’t know a word we just predict ‘O’. The following class does that. I implement it inheriting from a scikit-learn base classes to use the class with the inbuild cross-validation.
from sklearn.base import BaseEstimator, TransformerMixin
class MemoryTagger(BaseEstimator, TransformerMixin):
def fit(self, X, y):
'''
Expects a list of words as X and a list of tags as y.
'''
voc = {}
self.tags = []
for x, t in zip(X, y):
if t not in self.tags:
self.tags.append(t)
if x in voc:
if t in voc[x]:
voc[x][t] += 1
else:
voc[x][t] = 1
else:
voc[x] = {t: 1}
self.memory = {}
for k, d in voc.items():
self.memory[k] = max(d, key=d.get)
def predict(self, X, y=None):
'''
Predict the the tag from memory. If word is unknown, predict 'O'.
'''
return [self.memory.get(x, 'O') for x in X]
tagger = MemoryTagger()
tagger.fit(sent, tag)
print(tagger.predict(sent))
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']
tagger.tags
['O', 'B-geo', 'B-gpe']
Okay, it looks like it basically works. Now we do a 5-fold cross-validation.
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import classification_report
words = data["Word"].values.tolist()
tags = data["Tag"].values.tolist()
pred = cross_val_predict(estimator=MemoryTagger(), X=words, y=tags, cv=5)
We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.
report = classification_report(y_pred=pred, y_true=tags)
print(report)
precision recall f1-score support
B-art 0.23 0.06 0.10 402
B-eve 0.50 0.25 0.33 308
B-geo 0.78 0.84 0.81 37644
B-gpe 0.94 0.93 0.94 15870
B-nat 0.41 0.28 0.33 201
B-org 0.66 0.49 0.56 20143
B-per 0.79 0.64 0.71 16990
B-tim 0.87 0.77 0.82 20333
I-art 0.04 0.01 0.01 297
I-eve 0.36 0.12 0.18 253
I-geo 0.73 0.58 0.65 7414
I-gpe 0.61 0.45 0.52 198
I-nat 0.00 0.00 0.00 51
I-org 0.69 0.53 0.60 16784
I-per 0.73 0.66 0.69 17251
I-tim 0.56 0.13 0.21 6528
O 0.97 0.99 0.98 887908
avg / total 0.94 0.95 0.94 1048575
This looks not so bad! The precision is quit reasonable, but as you might have guessed, the recall is pretty weak. This is due to the fact, that we cannot predict on words we don’t know. To overcome this issue, we will now introduce a simple machine learning model to predict the named entities. To achieve this, we convert the data to a simple feature vector for every word and then use a random forest to classify the words.
from sklearn.ensemble import RandomForestClassifier
The most simple feature map only contains information of the word itself.
def feature_map(word):
'''Simple feature map.'''
return np.array([word.istitle(), word.islower(), word.isupper(), len(word), word.isdigit(), word.isalpha()])
words = [feature_map(w) for w in data["Word"].values.tolist()]
pred = cross_val_predict(RandomForestClassifier(n_estimators=20),
X=words, y=tags, cv=5)
report = classification_report(y_pred=pred, y_true=tags)
print(report)
precision recall f1-score support
B-art 0.00 0.00 0.00 402
B-eve 0.00 0.00 0.00 308
B-geo 0.26 0.80 0.40 37644
B-gpe 0.25 0.04 0.07 15870
B-nat 0.00 0.00 0.00 201
B-org 0.65 0.17 0.27 20143
B-per 0.97 0.20 0.33 16990
B-tim 0.29 0.32 0.30 20333
I-art 0.00 0.00 0.00 297
I-eve 0.00 0.00 0.00 253
I-geo 0.00 0.00 0.00 7414
I-gpe 0.00 0.00 0.00 198
I-nat 0.00 0.00 0.00 51
I-org 0.36 0.03 0.06 16784
I-per 0.47 0.02 0.04 17251
I-tim 0.50 0.06 0.11 6528
O 0.97 0.98 0.97 887908
avg / total 0.88 0.87 0.86 1048575
Wow, that looks really bad. This is expected, since the features lack a lot of information necessary for the decision. So now we enhance our simple features on the one hand by memory and on the other hand by using context information.
from sklearn.preprocessing import LabelEncoder
class FeatureTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self.memory_tagger = MemoryTagger()
self.tag_encoder = LabelEncoder()
self.pos_encoder = LabelEncoder()
def fit(self, X, y):
words = X["Word"].values.tolist()
self.pos = X["POS"].values.tolist()
tags = X["Tag"].values.tolist()
self.memory_tagger.fit(words, tags)
self.tag_encoder.fit(tags)
self.pos_encoder.fit(self.pos)
return self
def transform(self, X, y=None):
def pos_default(p):
if p in self.pos:
return self.pos_encoder.transform([p])[0]
else:
return -1
pos = X["POS"].values.tolist()
words = X["Word"].values.tolist()
out = []
for i in range(len(words)):
w = words[i]
p = pos[i]
if i < len(words) - 1:
wp = self.tag_encoder.transform(self.memory_tagger.predict([words[i+1]]))[0]
posp = pos_default(pos[i+1])
else:
wp = self.tag_encoder.transform(['O'])[0]
posp = pos_default(".")
if i > 0:
if words[i-1] != ".":
wm = self.tag_encoder.transform(self.memory_tagger.predict([words[i-1]]))[0]
posm = pos_default(pos[i-1])
else:
wm = self.tag_encoder.transform(['O'])[0]
posm = pos_default(".")
else:
posm = pos_default(".")
wm = self.tag_encoder.transform(['O'])[0]
out.append(np.array([w.istitle(), w.islower(), w.isupper(), len(w), w.isdigit(), w.isalpha(),
self.tag_encoder.transform(self.memory_tagger.predict([w]))[0],
pos_default(p), wp, wm, posp, posm]))
return out
from sklearn.pipeline import Pipeline
pred = cross_val_predict(Pipeline([("feature_map", FeatureTransformer()),
("clf", RandomForestClassifier(n_estimators=20, n_jobs=3))]),
X=data, y=tags, cv=5)
report = classification_report(y_pred=pred, y_true=tags)
print(report)
precision recall f1-score support
B-art 0.17 0.08 0.11 402
B-eve 0.40 0.28 0.33 308
B-geo 0.83 0.85 0.84 37644
B-gpe 0.98 0.93 0.95 15870
B-nat 0.20 0.23 0.22 201
B-org 0.73 0.64 0.68 20143
B-per 0.82 0.75 0.78 16990
B-tim 0.89 0.80 0.84 20333
I-art 0.03 0.01 0.01 297
I-eve 0.28 0.13 0.18 253
I-geo 0.76 0.67 0.71 7414
I-gpe 0.78 0.47 0.59 198
I-nat 0.38 0.22 0.28 51
I-org 0.73 0.67 0.70 16784
I-per 0.85 0.74 0.79 17251
I-tim 0.81 0.53 0.64 6528
O 0.98 0.99 0.99 887908
avg / total 0.96 0.96 0.96 1048575
This improved the result a bit, but this is still not very convincing. In the next post, I will show how to do better with more sophisticated algorithms.