This is the second post in my series about named entity recognition. If you haven’t seen the first one, have a look now. Last time we started by memorizing entities for words and then used a simple classification model to improve the results a bit. This model also used context properties and the structure of the word in question. But the results where not overwhelmingly good, so now we’re going to look into a more sophisticated algorithm, a so called conditional random field (CRF).
We denote $x = (x_1,\dots, x_m)$ as the input sequence, i.e. the words of a sentence and $s = (s_1,\dots, s_m)$ as the sequence of output states, i.e. the named entity tags. In conditional random fields we model the conditional probability
$$p(s_1,\dots,s_m|x_1,\dots,x_m).$$
We do this by define a feature map
$$\Phi(x_1,\dots,x_m,s_1,\dots,s_m)\in\mathbb{R}^d$$
that maps an entire input sequence $x$ paired with an entire state sequence $s$ to some $d$-dimensional feature vector. Then we can model the probability as a log-linear model with the parameter vector $w\in\mathbb{R}^d$
$$p(s|x; w) = \frac{\exp(w\cdot\Phi(x, s))}{\sum_{s^\prime} \exp(w\cdot\Phi(x, s^\prime))},$$
where $s^\prime$ ranges over all possible output sequences. For the estimation of $w$, we assume that we have a set of $n$ labeled examples ${(x^i, s^i)}_{i=1}^n$. Now we define the regularized log-likelihood function $L$
$$L(w) = \sum_{i=1}^n \log p(s^i|x^i; w) - \frac{\lambda_2}{2}|w|_2^2 - \lambda_1 |w|_1.$$
The terms $\frac{\lambda_2}{2}|w|_2^2$ and $\lambda_1 |w|_1$ forces the parameter vector to be small in the respective norm. This penalizes the model complexity and is known as regularization. The parameters $\lambda_2$ and $\lambda_1$ allows to enforce more or less regularization. The parameter vector $w^*$ is then estimated as
$$w^* = \text{arg max}_{w\in \mathbb{R}^d} L(w)$$
If we estimated the vector $w^*$, we can find the most likely tag a sentence $s^*$ for a sentence $x$ by
$$s^* = \text{arg max}_{s} p(s|x; w^*).$$
For more details we refer to M.Collins [http://www.cs.columbia.edu/~mcollins/crf.pdf].
Load the dataset
If you want to run the tutorial yourself, you can find the dataset here.
Now we want to apply this model. Let’s start by loading the data.
import pandas as pd
import numpy as np
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
| Sentence # | Word | POS | Tag |
---|
1048565 | Sentence: 47958 | impact | NN | O |
---|
1048566 | Sentence: 47958 | . | . | O |
---|
1048567 | Sentence: 47959 | Indian | JJ | B-gpe |
---|
1048568 | Sentence: 47959 | forces | NNS | O |
---|
1048569 | Sentence: 47959 | said | VBD | O |
---|
1048570 | Sentence: 47959 | they | PRP | O |
---|
1048571 | Sentence: 47959 | responded | VBD | O |
---|
1048572 | Sentence: 47959 | to | TO | O |
---|
1048573 | Sentence: 47959 | the | DT | O |
---|
1048574 | Sentence: 47959 | attack | NN | O |
---|
words = list(set(data["Word"].values))
n_words = len(words); n_words
35178
So we have 47959 sentences containing 35178 different words. We change the SentenceGetter class from last post a little and use it to retrieve sentences with their labels.
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
s["POS"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
getter = SentenceGetter(data)
This is how a sentence looks now.
[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]
Okay, that looks like expected, now get all sentences.
sentences = getter.sentences
Craft features
Now we craft a set of features and prepare the dataset.
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
})
else:
features['EOS'] = True
return features
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
return [label for token, postag, label in sent]
def sent2tokens(sent):
return [token for token, postag, label in sent]
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]
Fit the CRF
Now we can initialize the algorithm. We use the conditional random field (CRF) implementation provided by sklearn-crfsuite.
from sklearn_crfsuite import CRF
crf = CRF(algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=False)
Okay, let’s look if it works. Like last time, we performe a 5-fold cross-validation.
Evaluate the model
from sklearn.cross_validation import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)
We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.
report = flat_classification_report(y_pred=pred, y_true=y)
print(report)
precision recall f1-score support
B-art 0.37 0.11 0.17 402
B-eve 0.52 0.35 0.42 308
B-geo 0.85 0.90 0.88 37644
B-gpe 0.97 0.94 0.95 15870
B-nat 0.66 0.37 0.47 201
B-org 0.78 0.72 0.75 20143
B-per 0.84 0.81 0.82 16990
B-tim 0.93 0.88 0.90 20333
I-art 0.11 0.03 0.04 297
I-eve 0.34 0.21 0.26 253
I-geo 0.82 0.79 0.80 7414
I-gpe 0.92 0.55 0.69 198
I-nat 0.61 0.27 0.38 51
I-org 0.81 0.79 0.80 16784
I-per 0.84 0.89 0.87 17251
I-tim 0.83 0.76 0.80 6528
O 0.99 0.99 0.99 887908
avg / total 0.97 0.97 0.97 1048575
This looks like a good start. We easily beat the results from the last post.
CRF(algorithm='lbfgs', all_possible_states=None,
all_possible_transitions=False, averaging=None, c=None, c1=0.1, c2=0.1,
calibration_candidates=None, calibration_eta=None,
calibration_max_trials=None, calibration_rate=None,
calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
max_linesearch=None, min_freq=None, model_filename=None,
num_memories=None, pa_type=None, period=None, trainer_cls=None,
variance=None, verbose=False)
Inspect the model
The nice thing about CRFs is, that we can look into the algorithm and visualize the transition probabilites from one tag to another. We also can see which features are important for predicting a certain tag. We use the eli5 library to performe the investigation.
eli5.show_weights(crf, top=30)
Improve the model with regularization
Puh, it looks like the CRF just remembering a lot of words. For example for the tag ‘B-per’, the algorithm remembers ‘president’ ‘obama’. To overcome this issue we can tune the parameters, especially the regularization parameters of the CRF algorithm. The $c_1$ and $c_2$ parameter of the CRF algorithm are the regularization parameters $\lambda_1$ and $\lambda_2$. While $c_1$ weights the $l_1$ regularization, the $c_2$ parameter weights the $l_2$ regularization. We know limit the number of features used by enforcing sparsity on the parameter vector $w$. To do this we increase the $l_1$-regularization parameter $c_1$.
crf = CRF(algorithm='lbfgs',
c1=10,
c2=0.1,
max_iterations=100,
all_possible_transitions=False)
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)
report = flat_classification_report(y_pred=pred, y_true=y)
print(report)
precision recall f1-score support
B-art 0.00 0.00 0.00 402
B-eve 0.80 0.27 0.40 308
B-geo 0.82 0.90 0.86 37644
B-gpe 0.95 0.92 0.94 15870
B-nat 0.69 0.09 0.16 201
B-org 0.78 0.67 0.72 20143
B-per 0.80 0.76 0.78 16990
B-tim 0.93 0.83 0.88 20333
I-art 0.00 0.00 0.00 297
I-eve 0.64 0.12 0.20 253
I-geo 0.81 0.73 0.77 7414
I-gpe 0.93 0.37 0.53 198
I-nat 0.00 0.00 0.00 51
I-org 0.75 0.76 0.75 16784
I-per 0.80 0.90 0.85 17251
I-tim 0.84 0.67 0.74 6528
O 0.99 0.99 0.99 887908
avg / total 0.96 0.97 0.96 1048575
This looks quite nice.
CRF(algorithm='lbfgs', all_possible_states=None,
all_possible_transitions=False, averaging=None, c=None, c1=10, c2=0.1,
calibration_candidates=None, calibration_eta=None,
calibration_max_trials=None, calibration_rate=None,
calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
max_linesearch=None, min_freq=None, model_filename=None,
num_memories=None, pa_type=None, period=None, trainer_cls=None,
variance=None, verbose=False)
Now we look again at the features.
eli5.show_weights(crf, top=30)
As expected, we see, that the model stops to rely on words and uses the context more, as it generalizes better is more useful over multiple training instances. This is an effect of the $l_1$-regularization.
This is it for this time, but stay tuned for the next post, where we will look at named entity recognition with recurrent neural networks.