Introduction to named entity recognition in python

In this post, I will introduce you to something called Named Entity Recognition (NER). NER is a part of natural language processing (NLP) and information retrieval (IR). The task in NER is to find the entity-type of words. Entities can, for example, be locations, time expressions or names. If you want to run the tutorial yourself, you can find the dataset here. Now we load it and peak at a few examples.

import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")

data = data.fillna(method="ffill")

data.tail(10)

	Sentence #	Word	POS	Tag
1048565	Sentence: 47958	impact	NN	O
1048566	Sentence: 47958	.	.	O
1048567	Sentence: 47959	Indian	JJ	B-gpe
1048568	Sentence: 47959	forces	NNS	O
1048569	Sentence: 47959	said	VBD	O
1048570	Sentence: 47959	they	PRP	O
1048571	Sentence: 47959	responded	VBD	O
1048572	Sentence: 47959	to	TO	O
1048573	Sentence: 47959	the	DT	O
1048574	Sentence: 47959	attack	NN	O

words = list(set(data["Word"].values))

n_words = len(words); n_words

So we have 47959 sentences containing 35178 different words.

We start by writing a small class to retrieve a sentence from the dataset.

class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
    
    def get_next(self):
        try:
            s = self.data[self.data["Sentence #"] == "Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s["Word"].values.tolist(), s["POS"].values.tolist(), s["Tag"].values.tolist()    
        except:
            self.empty = True
            return None, None, None

getter = SentenceGetter(data)

sent, pos, tag = getter.get_next()

This is how a sentence looks.

print(sent); print(pos); print(tag)

['They', 'marched', 'from', 'the', 'Houses', 'of', 'Parliament', 'to', 'a', 'rally', 'in', 'Hyde', 'Park', '.']
['PRP', 'VBD', 'IN', 'DT', 'NNS', 'IN', 'NN', 'TO', 'DT', 'NN', 'IN', 'NNP', 'NNP', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'I-geo', 'O']

The first simple idea and baseline might be to just remember the most common named entity for every word and predict that. In case we don’t know a word we just predict ‘O’. The following class does that. I implement it inheriting from a scikit-learn base classes to use the class with the inbuild cross-validation.

from sklearn.base import BaseEstimator, TransformerMixin


class MemoryTagger(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y):
        '''
        Expects a list of words as X and a list of tags as y.
        '''
        voc = {}
        self.tags = []
        for x, t in zip(X, y):
            if t not in self.tags:
                self.tags.append(t)
            if x in voc:
                if t in voc[x]:
                    voc[x][t] += 1
                else:
                    voc[x][t] = 1
            else:
                voc[x] = {t: 1}
        self.memory = {}
        for k, d in voc.items():
            self.memory[k] = max(d, key=d.get)
    
    def predict(self, X, y=None):
        '''
        Predict the the tag from memory. If word is unknown, predict 'O'.
        '''
        return [self.memory.get(x, 'O') for x in X]

tagger = MemoryTagger()

tagger.fit(sent, tag)

print(tagger.predict(sent))

['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']

tagger.tags

['O', 'B-geo', 'B-gpe']

Okay, it looks like it basically works. Now we do a 5-fold cross-validation.

from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import classification_report

words = data["Word"].values.tolist()
tags = data["Tag"].values.tolist()

pred = cross_val_predict(estimator=MemoryTagger(), X=words, y=tags, cv=5)

We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.

report = classification_report(y_pred=pred, y_true=tags)
print(report)

             precision    recall  f1-score   support

      B-art       0.23      0.06      0.10       402
      B-eve       0.50      0.25      0.33       308
      B-geo       0.78      0.84      0.81     37644
      B-gpe       0.94      0.93      0.94     15870
      B-nat       0.41      0.28      0.33       201
      B-org       0.66      0.49      0.56     20143
      B-per       0.79      0.64      0.71     16990
      B-tim       0.87      0.77      0.82     20333
      I-art       0.04      0.01      0.01       297
      I-eve       0.36      0.12      0.18       253
      I-geo       0.73      0.58      0.65      7414
      I-gpe       0.61      0.45      0.52       198
      I-nat       0.00      0.00      0.00        51
      I-org       0.69      0.53      0.60     16784
      I-per       0.73      0.66      0.69     17251
      I-tim       0.56      0.13      0.21      6528
          O       0.97      0.99      0.98    887908

avg / total       0.94      0.95      0.94   1048575

This looks not so bad! The precision is quit reasonable, but as you might have guessed, the recall is pretty weak. This is due to the fact, that we cannot predict on words we don’t know. To overcome this issue, we will now introduce a simple machine learning model to predict the named entities. To achieve this, we convert the data to a simple feature vector for every word and then use a random forest to classify the words.

from sklearn.ensemble import RandomForestClassifier

The most simple feature map only contains information of the word itself.

def feature_map(word):
    '''Simple feature map.'''
    return np.array([word.istitle(), word.islower(), word.isupper(), len(word), word.isdigit(), word.isalpha()])

words = [feature_map(w) for w in data["Word"].values.tolist()]

pred = cross_val_predict(RandomForestClassifier(n_estimators=20),
                         X=words, y=tags, cv=5)

report = classification_report(y_pred=pred, y_true=tags)
print(report)

             precision    recall  f1-score   support

      B-art       0.00      0.00      0.00       402
      B-eve       0.00      0.00      0.00       308
      B-geo       0.26      0.80      0.40     37644
      B-gpe       0.25      0.04      0.07     15870
      B-nat       0.00      0.00      0.00       201
      B-org       0.65      0.17      0.27     20143
      B-per       0.97      0.20      0.33     16990
      B-tim       0.29      0.32      0.30     20333
      I-art       0.00      0.00      0.00       297
      I-eve       0.00      0.00      0.00       253
      I-geo       0.00      0.00      0.00      7414
      I-gpe       0.00      0.00      0.00       198
      I-nat       0.00      0.00      0.00        51
      I-org       0.36      0.03      0.06     16784
      I-per       0.47      0.02      0.04     17251
      I-tim       0.50      0.06      0.11      6528
          O       0.97      0.98      0.97    887908

avg / total       0.88      0.87      0.86   1048575

Wow, that looks really bad. This is expected, since the features lack a lot of information necessary for the decision. So now we enhance our simple features on the one hand by memory and on the other hand by using context information.

from sklearn.preprocessing import LabelEncoder

class FeatureTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.memory_tagger = MemoryTagger()
        self.tag_encoder = LabelEncoder()
        self.pos_encoder = LabelEncoder()
        
    def fit(self, X, y):
        words = X["Word"].values.tolist()
        self.pos = X["POS"].values.tolist()
        tags = X["Tag"].values.tolist()
        self.memory_tagger.fit(words, tags)
        self.tag_encoder.fit(tags)
        self.pos_encoder.fit(self.pos)
        return self
    
    def transform(self, X, y=None):
        def pos_default(p):
            if p in self.pos:
                return self.pos_encoder.transform([p])[0]
            else:
                return -1
        
        pos = X["POS"].values.tolist()
        words = X["Word"].values.tolist()
        out = []
        for i in range(len(words)):
            w = words[i]
            p = pos[i]
            if i < len(words) - 1:
                wp = self.tag_encoder.transform(self.memory_tagger.predict([words[i+1]]))[0]
                posp = pos_default(pos[i+1])
            else:
                wp = self.tag_encoder.transform(['O'])[0]
                posp = pos_default(".")
            if i > 0:
                if words[i-1] != ".":
                    wm = self.tag_encoder.transform(self.memory_tagger.predict([words[i-1]]))[0]
                    posm = pos_default(pos[i-1])
                else:
                    wm = self.tag_encoder.transform(['O'])[0]
                    posm = pos_default(".")
            else:
                posm = pos_default(".")
                wm = self.tag_encoder.transform(['O'])[0]
            out.append(np.array([w.istitle(), w.islower(), w.isupper(), len(w), w.isdigit(), w.isalpha(),
                                 self.tag_encoder.transform(self.memory_tagger.predict([w]))[0],
                                 pos_default(p), wp, wm, posp, posm]))
        return out

from sklearn.pipeline import Pipeline

pred = cross_val_predict(Pipeline([("feature_map", FeatureTransformer()), 
                                   ("clf", RandomForestClassifier(n_estimators=20, n_jobs=3))]),
                         X=data, y=tags, cv=5)

report = classification_report(y_pred=pred, y_true=tags)
print(report)

             precision    recall  f1-score   support

      B-art       0.17      0.08      0.11       402
      B-eve       0.40      0.28      0.33       308
      B-geo       0.83      0.85      0.84     37644
      B-gpe       0.98      0.93      0.95     15870
      B-nat       0.20      0.23      0.22       201
      B-org       0.73      0.64      0.68     20143
      B-per       0.82      0.75      0.78     16990
      B-tim       0.89      0.80      0.84     20333
      I-art       0.03      0.01      0.01       297
      I-eve       0.28      0.13      0.18       253
      I-geo       0.76      0.67      0.71      7414
      I-gpe       0.78      0.47      0.59       198
      I-nat       0.38      0.22      0.28        51
      I-org       0.73      0.67      0.70     16784
      I-per       0.85      0.74      0.79     17251
      I-tim       0.81      0.53      0.64      6528
          O       0.98      0.99      0.99    887908

avg / total       0.96      0.96      0.96   1048575

This improved the result a bit, but this is still not very convincing. In the next post, I will show how to do better with more sophisticated algorithms.

2017-08-26 | Tobias Sterbak

Introduction to named entity recognition in python