Named entity recognition with conditional random fields in python

This is the second post in my series about named entity recognition. If you haven’t seen the first one, have a look now. Last time we started by memorizing entities for words and then used a simple classification model to improve the results a bit. This model also used context properties and the structure of the word in question. But the results where not overwhelmingly good, so now we’re going to look into a more sophisticated algorithm, a so called conditional random field (CRF).

We denote $x = (x_1,\dots, x_m)$ as the input sequence, i.e. the words of a sentence and $s = (s_1,\dots, s_m)$ as the sequence of output states, i.e. the named entity tags. In conditional random fields we model the conditional probability $p(s_1,\dots,s_m|x_1,\dots,x_m).$ We do this by define a feature map $\Phi(x_1,\dots,x_m,s_1,\dots,s_m)\in\mathbb{R}^d$ that maps an entire input sequence $x$ paired with an entire state sequence $s$ to some $d$ -dimensional feature vector. Then we can model the probability as a log-linear model with the parameter vector $w\in\mathbb{R}^d$ $p(s|x; w) = \frac{\exp(w\cdot\Phi(x, s))}{\sum_{s^\prime} \exp(w\cdot\Phi(x, s^\prime))},$ where $s^\prime$ ranges over all possible output sequences. For the estimation of $w$ , we assume that we have a set of $n$ labeled examples ${(x^i, s^i)}_{i=1}^n$ . Now we define the regularized log-likelihood function $L$ $L(w) = \sum_{i=1}^n \log p(s^i|x^i; w) - \frac{\lambda_2}{2}|w|_2^2 - \lambda_1 |w|_1.$ The terms $\frac{\lambda_2}{2}|w|_2^2$ and $\lambda_1 |w|_1$ forces the parameter vector to be small in the respective norm. This penalizes the model complexity and is known as regularization. The parameters $\lambda_2$ and $\lambda_1$ allows to enforce more or less regularization. The parameter vector $w^*$ is then estimated as $w^* = \text{arg max}_{w\in \mathbb{R}^d} L(w)$ If we estimated the vector $w^*$ , we can find the most likely tag a sentence $s^*$ for a sentence $x$ by $s^* = \text{arg max}_{s} p(s|x; w^*).$ For more details we refer to M.Collins [http://www.cs.columbia.edu/~mcollins/crf.pdf].

Load the dataset

If you want to run the tutorial yourself, you can find the dataset here. Now we want to apply this model. Let’s start by loading the data.

import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")

data = data.fillna(method="ffill")

data.tail(10)

	Sentence #	Word	POS	Tag
1048565	Sentence: 47958	impact	NN	O
1048566	Sentence: 47958	.	.	O
1048567	Sentence: 47959	Indian	JJ	B-gpe
1048568	Sentence: 47959	forces	NNS	O
1048569	Sentence: 47959	said	VBD	O
1048570	Sentence: 47959	they	PRP	O
1048571	Sentence: 47959	responded	VBD	O
1048572	Sentence: 47959	to	TO	O
1048573	Sentence: 47959	the	DT	O
1048574	Sentence: 47959	attack	NN	O

words = list(set(data["Word"].values))

n_words = len(words); n_words

So we have 47959 sentences containing 35178 different words. We change the SentenceGetter class from last post a little and use it to retrieve sentences with their labels.

class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

getter = SentenceGetter(data)

sent = getter.get_next()

This is how a sentence looks now.

print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]

Okay, that looks like expected, now get all sentences.

sentences = getter.sentences

Craft features

Now we craft a set of features and prepare the dataset.

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

Fit the CRF

Now we can initialize the algorithm. We use the conditional random field (CRF) implementation provided by sklearn-crfsuite.

from sklearn_crfsuite import CRF

crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=False)

Okay, let’s look if it works. Like last time, we performe a 5-fold cross-validation.

Evaluate the model

from sklearn.cross_validation import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report

pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)

We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.

report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

             precision    recall  f1-score   support

      B-art       0.37      0.11      0.17       402
      B-eve       0.52      0.35      0.42       308
      B-geo       0.85      0.90      0.88     37644
      B-gpe       0.97      0.94      0.95     15870
      B-nat       0.66      0.37      0.47       201
      B-org       0.78      0.72      0.75     20143
      B-per       0.84      0.81      0.82     16990
      B-tim       0.93      0.88      0.90     20333
      I-art       0.11      0.03      0.04       297
      I-eve       0.34      0.21      0.26       253
      I-geo       0.82      0.79      0.80      7414
      I-gpe       0.92      0.55      0.69       198
      I-nat       0.61      0.27      0.38        51
      I-org       0.81      0.79      0.80     16784
      I-per       0.84      0.89      0.87     17251
      I-tim       0.83      0.76      0.80      6528
          O       0.99      0.99      0.99    887908

avg / total       0.97      0.97      0.97   1048575

This looks like a good start. We easily beat the results from the last post.

crf.fit(X, y)

CRF(algorithm='lbfgs', all_possible_states=None,
    all_possible_transitions=False, averaging=None, c=None, c1=0.1, c2=0.1,
    calibration_candidates=None, calibration_eta=None,
    calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None,
    num_memories=None, pa_type=None, period=None, trainer_cls=None,
    variance=None, verbose=False)

Inspect the model

The nice thing about CRFs is, that we can look into the algorithm and visualize the transition probabilites from one tag to another. We also can see which features are important for predicting a certain tag. We use the eli5 library to performe the investigation.

import eli5

eli5.show_weights(crf, top=30)

From \ To	O	B-art	I-art	B-eve	I-eve	B-geo	I-geo	B-gpe	I-gpe	B-nat	I-nat	B-org	I-org	B-per	I-per	B-tim	I-tim
O	4.29	0.879	0.0	1.575	0.0	2.092	0.0	1.387	0.0	1.605	0.0	2.497	0.0	4.17	0.0	2.986	0.0
B-art	-0.014	0.0	8.442	0.0	0.0	-0.398	0.0	0.0	0.0	0.0	0.0	0.516	0.0	-0.844	0.0	0.336	0.0
I-art	-0.651	0.0	8.04	0.0	0.0	-0.702	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.016	0.0	-0.684	0.0
B-eve	-0.753	0.0	0.0	0.0	7.956	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.572	0.0
I-eve	-0.324	0.0	0.0	0.0	7.341	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	-0.621	0.0
B-geo	0.677	0.752	0.0	0.545	0.0	0.0	8.752	0.579	0.0	0.0	0.0	1.155	0.0	1.143	0.0	2.344	0.0
I-geo	-0.469	0.822	0.0	0.0	0.0	0.0	7.424	-1.366	0.0	0.0	0.0	-0.074	0.0	1.331	0.0	1.033	0.0
B-gpe	0.679	-1.609	0.0	-0.32	0.0	0.681	0.0	0.0	7.485	0.0	0.0	2.05	0.0	1.459	0.0	0.767	0.0
I-gpe	-0.298	0.0	0.0	0.0	0.0	-1.087	0.0	0.0	6.337	0.0	0.0	0.0	0.0	0.148	0.0	0.0	0.0
B-nat	-1.108	0.0	0.0	0.0	0.0	0.625	0.0	0.0	0.0	0.0	7.067	0.0	0.0	-0.305	0.0	-0.413	0.0
I-nat	-1.979	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	5.197	0.0	0.0	1.188	0.0	0.0	0.0
B-org	0.051	1.32	0.0	0.0	0.0	-0.331	0.0	0.447	0.0	0.0	0.0	0.0	7.109	1.054	0.0	0.255	0.0
I-org	-0.242	0.0	0.0	0.0	0.0	-1.562	0.0	0.573	0.0	0.0	0.0	0.0	7.236	1.639	0.0	0.421	0.0
B-per	0.364	0.0	0.0	0.0	0.0	0.723	0.0	0.734	0.0	2.176	0.0	2.405	0.0	0.0	7.146	1.165	0.0
I-per	0.18	0.0	0.0	0.0	0.0	-2.072	0.0	-1.568	0.0	0.0	0.0	-0.341	0.0	0.0	6.299	1.055	0.0
B-tim	0.286	-1.079	0.0	0.249	0.0	-0.083	0.0	-1.338	0.0	0.061	0.0	-0.148	0.0	1.338	0.0	0.0	7.245
I-tim	-0.263	0.0	0.0	0.072	0.0	-0.11	0.0	-1.437	0.0	0.0	0.0	-0.374	0.0	1.854	0.0	0.0	7.069

y=O top features

y=B-art top features

y=I-art top features

y=B-eve top features

y=I-eve top features

y=B-geo top features

y=I-geo top features

y=B-gpe top features

y=I-gpe top features

y=B-nat top features

y=I-nat top features

y=B-org top features

y=I-org top features

y=B-per top features

y=I-per top features

y=B-tim top features

y=I-tim top features

Weight^?	Feature
+8.012	word.lower():last
+7.999	word.lower():month
+5.813	word.lower():chairman
+5.612	word.lower():columbia
+5.555	word.lower():year
+5.232	word.lower():week
+5.146	word.lower():months
+5.067	word.lower():internet
+4.833	word.lower():weeks
+4.726	word.lower():after
+4.684	word.lower():republicans
+4.558	word[-3:]:And
+4.436	word.lower():ambassador
+4.406	word.lower():chief
+4.383	word.lower():trade
+4.344	word.lower():early
+4.272	word.lower():years
+4.216	+1:word.lower():americans
+4.140	word.lower():tourism
+4.127	+1:word.lower():american
+4.079	word.lower():christian
+4.075	word.lower():spokesman
+4.060	word[-3:]:De
+4.060	word[-2:]:De
… 9204 more positive …
… 5208 more negative …
-4.091	word[-2:]:0s
-4.158	word.lower():afternoon
-4.447	word.lower():palestinian
-4.515	word.lower():summer
-4.607	word.lower():morning
-4.801	word.lower():multi-party

Weight^?	Feature
+5.369	word.lower():twitter
+4.858	word.lower():spaceshipone
+4.294	word.lower():nevirapine
+4.271	+1:word.lower():enkhbayar
+4.263	+1:word.lower():boots
+3.893	word.lower():english
+3.802	-1:word.lower():engine
+3.655	word[-3:]:One
+3.588	-1:word.lower():film
+3.540	word.lower():russian
+3.499	word.lower():canal
+3.397	+1:word.lower():al-arabiya
+3.345	-1:word.lower():adumim
+3.237	word.lower():sopranos
+3.186	-1:word.lower():to
+3.150	word.lower():spanish
+3.130	-1:word.lower():shown
+3.014	word.lower():economics
+3.006	-1:word.lower():tamilnet
+2.997	word.lower():frankenstadion
+2.973	word.lower():settlement
+2.936	word[-2:]:00
+2.919	word.lower():dollar
+2.889	-1:word.lower():republic
+2.889	+1:word.lower():helicopters
+2.877	+1:word.lower():search
+2.875	-1:word.lower():program
+2.831	word.lower():endeavor
+2.711	word[-3:]:vor
+2.685	word.lower():sidnaya
… 957 more positive …
… 81 more negative …

Weight^?	Feature
+3.025	-1:word.lower():boeing
+2.553	+1:word.lower():gained
+2.473	+1:word.lower():came
+2.418	-1:word.lower():cajun
+2.297	word.lower():notice
+2.260	word.lower():constitution
+2.112	word.lower():flowers
+2.109	+1:word.lower():times
+2.072	+1:word.lower():marks
+2.056	word.lower():a
+2.048	+1:word.lower():teshome
+1.980	+1:word.lower():treaty
+1.876	+1:word.lower():expands
+1.875	+1:word.lower():reports
+1.866	-1:word.lower():dignity
+1.859	word.lower():dome
+1.852	+1:word.lower():early
+1.844	+1:word.lower():roses
+1.805	-1:word.lower():jerusalem
+1.800	-1:word.lower():balad
+1.793	+1:word.lower():outside
+1.779	word.lower():monument
+1.774	-1:word.lower():baghdad
+1.765	-1:word.lower():beijing
+1.757	+1:word.lower():rival
+1.747	-1:word.lower():hitler
+1.668	word[-3:]:One
+1.667	word.lower():lies
+1.660	word.lower():declaration
+1.645	word.lower():mustard
… 882 more positive …
… 81 more negative …

Weight^?	Feature
+4.333	word.lower():games
+4.263	word.lower():ramadan
+4.160	-1:word.lower():falklands
+3.501	-1:word.lower():typhoon
+3.484	word[-3:]:mes
+3.050	+1:word.lower():dean
+3.046	+1:word.lower():men
+3.028	-1:word.lower():wars
+2.942	-1:word.lower():happy
+2.938	-1:word.lower():solemn
+2.915	word.lower():hopman
+2.899	word.lower():katrina
+2.846	word.lower():olympic
+2.843	word[-3:]:pic
+2.758	-1:word.lower():war
+2.745	word.lower():parma
+2.714	-1:word.lower():midnight
+2.596	word.lower():australian
+2.570	-1:word.lower():2002
+2.547	+1:word.lower():security
+2.518	+1:word.lower():sabbath
+2.454	+1:word.lower():open
+2.446	+1:word.lower():event
+2.442	word.lower():passover
+2.433	-1:word.lower():nazi
+2.409	+1:word.lower():ends
+2.390	word.lower():holocaust
+2.350	-1:word.lower():reigning
+2.262	word[-3:]:mme
+2.262	word.lower():somme
… 437 more positive …
… 49 more negative …

Weight^?	Feature
+4.329	+1:word.lower():mascots
+3.603	word.lower():games
+3.022	+1:word.lower():era
+2.756	word.lower():series
+2.577	word.lower():dean
+2.509	+1:word.lower():rally
+2.508	+1:word.lower():caused
+2.504	+1:word.lower():disaster
+2.441	word.lower():sabbath
+2.426	+1:word.lower():tore
+2.420	+1:word.lower():without
+2.230	-1:word.lower():jewish
+2.220	+1:word.lower():now
+2.216	+1:word.lower():project
+2.164	+1:word.lower():suicide
+2.112	-1:word.lower():awareness
+1.940	+1:word.lower():holiday
+1.916	+1:word.lower():peace
+1.880	word[-3:]:ean
+1.861	-1:word.lower():hurricane
+1.831	+1:word.lower():even
+1.828	+1:word.lower():finals
+1.762	word.lower():conference
+1.760	-1:word.lower():typhoon
+1.753	-1:word.lower():may
+1.743	+1:word.lower():tennis
+1.712	-1:word.lower():rights
+1.702	word.lower():year
+1.699	+1:word.lower():olympics
+1.696	word.lower():awareness
… 393 more positive …
… 64 more negative …

Weight^?	Feature
+6.238	word.lower():mid-march
+6.002	word.lower():caribbean
+5.503	word.lower():martian
+5.446	word.lower():beijing
+5.086	word.lower():persian
+4.737	-1:word.lower():hamas
+4.521	-1:word.lower():mr.
+4.509	word.lower():balkans
+4.362	-1:word.lower():serb
+4.310	word.lower():quake-zone
+4.224	word.lower():philippines
+4.192	word.lower():burma
+4.169	+1:word.lower():phoned
+4.167	word.lower():washington
+4.152	word.lower():france
+4.137	word.lower():paris
+4.131	-1:word.lower():taleban
+4.016	-1:word.lower():bordeaux
+3.943	word.lower():mars
+3.900	+1:word.lower():moqtada
+3.886	-1:word.lower():cypriot
+3.870	word.lower():mid-june
+3.837	word.lower():wheeler
+3.788	word.lower():pearl
+3.744	-1:word.lower():malaysian
+3.698	word.lower():athens
+3.616	word.lower():séances
+3.616	word.lower():port-au-prince
+3.589	word.lower():christians
… 5949 more positive …
… 1365 more negative …
-4.659	word[-3:]:The

Weight^?	Feature
+4.211	word.lower():led-invasion
+4.151	word.lower():holiday
+4.065	word.lower():caribbean
+3.651	+1:word.lower():possessions
+3.446	+1:word.lower():regional
+3.430	+1:word.lower():french
+3.374	-1:word.lower():nahr
+3.296	-1:word.lower():tokugawa
+3.296	word.lower():shogunate
+3.232	word.lower():restaurant
+3.127	word.lower():island
+3.063	word.lower():autonomy
+3.059	+1:word.lower():produced
+3.054	-1:word.lower():kennedy
+2.992	-1:word.lower():christmas
+2.890	word.lower():ocean
+2.885	word.lower():east
+2.852	+1:word.lower():block
+2.826	-1:word.lower():sumatran
+2.745	-1:word.lower():surma
+2.721	-1:word.lower():john
+2.675	word.lower():subway
+2.645	+1:word.lower():crude
+2.635	+1:word.lower():service
+2.623	+1:word.lower():holidays
+2.593	word.lower():lions
+2.482	+1:word.lower():islamic
+2.409	+1:word.lower():crisis
… 2989 more positive …
… 525 more negative …
-2.367	word[-3:]:ost
-2.493	word[-3:]:day

Weight^?	Feature
+6.735	word.lower():afghan
+6.602	word.lower():niger
+6.219	word.lower():nepal
+5.432	word.lower():spaniard
+5.391	word.lower():azerbaijan
+5.138	word.lower():iranian
+5.127	word.lower():mexican
+5.080	word.lower():argentine
+4.926	word.lower():gibraltar
+4.829	word.lower():iraqi
+4.706	word.lower():spaniards
+4.662	word.lower():croats
+4.638	word.lower():venezuelan
+4.599	word.lower():cuban
+4.526	word.lower():korean
+4.526	word.lower():polish
+4.480	word.lower():aussies
+4.313	word.lower():bahamas
+4.301	word.lower():syrian
+4.280	word.lower():andorra
+4.278	word.lower():jordan
+4.271	word.lower():turkish
+4.234	word.lower():madagonia
+4.226	word.lower():chechen
+4.224	word.lower():chilean
+4.215	word.lower():kenyan
+4.209	word.lower():irish
+4.206	word.lower():egyptian
+4.191	word.lower():palestinians
+4.147	word.istitle()
… 1434 more positive …
… 505 more negative …

Weight^?	Feature
+5.622	+1:word.lower():mayor
+4.073	-1:word.lower():democratic
+3.844	-1:word.lower():bosnian
+3.602	+1:word.lower():developed
+3.543	word.lower():korean
+3.308	word[-3:]:can
+3.226	-1:word.lower():soviet
+3.217	word.lower():city
+3.179	+1:word.lower():health
+3.172	word.lower():cypriots
+3.000	word.lower():britons
+2.857	+1:word.lower():under
+2.841	+1:word.lower():iraq
+2.737	+1:word.lower():invaded
+2.619	+1:word.lower():man
+2.601	+1:word.lower():returned
+2.547	-1:word.lower():islamic
+2.532	+1:word.lower():did
+2.471	+1:word.lower():also
+2.449	word[-2:]:bs
+2.327	word.lower():indians
+2.307	word.lower():cypriot
+2.294	word[-3:]:iot
+2.220	word[-3:]:ots
+2.188	-1:word.lower():panama
+2.159	+1:word.lower():began
+2.109	word[-3:]:ovy
+2.109	word.lower():muscovy
+2.095	+1:word.lower():countries
+2.067	word[-2:]:ot
… 207 more positive …
… 40 more negative …

Weight^?	Feature
+6.149	word.lower():katrina
+5.371	word.lower():marburg
+4.334	word.lower():rita
+3.535	+1:word.lower():shot
+2.959	word[-3:]:ita
+2.791	word.lower():leukemia
+2.769	word[-3:]:urg
+2.759	word[-3:]:mia
+2.665	word.lower():paul
+2.647	+1:word.lower():strain
+2.595	word[-2:]:N1
+2.552	word.lower():ebola
+2.505	word[-3:]:5N1
+2.505	word.lower():h5n1
+2.505	+1:word.lower():immunization
+2.454	word[-3:]:aul
+2.444	word.lower():danielle
+2.379	+1:word.lower():lives
+2.349	word.lower():acc
+2.349	word[-3:]:ACC
+2.337	-1:word.lower():often-deadly
+2.322	-1:word.lower():7,000
+2.280	word[-2:]:TB
+2.222	+1:word.lower():epidemics
+2.174	word[-2:]:rg
+2.158	word.isupper()
+2.147	+1:word.lower():should
+2.140	-1:word.lower():case
+2.133	word.lower():amur
+2.121	+1:word.lower():correctly
… 242 more positive …
… 39 more negative …

Weight^?	Feature
+2.681	word.lower():rita
+2.327	word[-3:]:ita
+2.315	+1:word.lower():outbreak
+1.944	-1:word.lower():hurricanes
+1.909	word[-2:]:ta
+1.747	word.lower():flu
+1.670	word[-2:]:lu
+1.654	-1:word.lower():type
+1.624	+1:word.lower():relief
+1.613	-1:postag:NN
+1.572	-1:word.istitle()
+1.570	-1:word.lower():heart
+1.471	+1:word.lower():last
+1.422	+1:word.lower():slammed
+1.421	-1:word.lower():jing
+1.421	word.lower():jing
+1.400	word.lower():katrina
+1.280	+1:word.lower():says
+1.171	word.lower():disease
+1.170	-1:word.lower():hurricane
+1.153	-1:word.lower():avian
+1.137	word.lower():circumpolar
+1.121	word[-3:]:Flu
+1.092	-1:word.lower():antarctic
+1.068	-1:word.lower():circumpolar
+1.066	word[-3:]:ase
+1.051	+1:word.lower():current
+1.050	word[-3:]:ina
+1.045	word.lower():current
+1.036	word[-2:]:ba
… 91 more positive …
… 20 more negative …

Weight^?	Feature
+7.344	word.lower():philippine
+6.075	word.lower():mid-march
+5.812	word.lower():hamas
+5.779	-1:word.lower():rice
+5.629	word.lower():al-qaida
+5.071	word.lower():taleban
+4.756	word.lower():taliban
+4.729	-1:word.lower():senator
+4.723	word.lower():reuters
+4.662	word.lower():hezbollah
+4.618	word.lower():university
+4.565	word.lower():conocophillips
+4.295	word.lower():boeing
+4.269	word.lower():senate
+4.244	word.lower():constantinople
+4.240	word.lower():kindhearts
+4.141	word.lower():boers
+4.092	-1:word.lower():singh
+4.061	word.lower():exxonmobil
+4.054	-1:word.lower():nepal
+4.002	word.lower():yukos
+3.997	word.lower():munich
+3.969	-1:word.lower():niger
+3.943	word.lower():congress
+3.920	word.lower():xinhua
+3.909	word.lower():mcdonald
+3.907	word.lower():daimlerchrysler
+3.845	word.lower():convergence
+3.845	-1:word.lower():israel
+3.824	-1:word.lower():semi-autonomous
… 6796 more positive …
… 1476 more negative …

Weight^?	Feature
+3.981	+1:word.lower():attained
+3.785	+1:word.lower():reporter
+3.486	-1:word.lower():associated
+3.463	word.lower():singapore
+3.400	word.lower():member-countries
+3.365	-1:word.lower():decathlon
+3.360	+1:word.lower():ohlmert
+3.343	word.lower():times
+3.335	word.lower():member-states
+3.282	+1:word.lower():separating
+3.264	-1:word.lower():&
+3.156	+1:word.lower():mulgueta
+3.127	word.lower():nations
+3.126	word.lower():holiday
+3.099	word.lower():decathlon
+3.067	+1:word.lower():ms.
+3.063	+1:word.lower():1947
+3.041	word.lower():airlines
+3.029	word.lower():washington
+2.900	+1:word.lower():post
+2.884	word.lower():relief
+2.880	word.lower():protests
+2.877	+1:word.lower():mil
+2.855	word.lower():ohlmert
… 6749 more positive …
… 1545 more negative …
-3.007	-1:word.lower():hamas
-3.224	-1:word.lower():minister
-3.233	word[-2:]:hn
-3.909	word[-2:]:lf
-4.079	word.lower():city
-4.283	word.lower():secretary

Weight^?	Feature
+7.301	word.lower():president
+6.125	word.lower():obama
+5.647	word.lower():senator
+5.367	word.lower():greenspan
+5.325	word.lower():vice
+4.824	word.lower():western
+4.721	word.lower():hall
+4.600	word.lower():prime
+4.541	word.lower():clinton
+4.510	word.lower():frank
+4.383	word.lower():cobain
+4.318	word.lower():milosevic
+4.177	word.lower():brent
+4.002	word[-2:]:r.
+3.953	word.lower():johnston
+3.919	word.lower():spears
+3.823	word.lower():zidane
+3.811	word.lower():al-zarqawi
+3.796	word.lower():mccain
+3.771	word.lower():toure
+3.722	word.lower():barghouti
+3.670	word.lower():rice
+3.660	+1:word.lower():extra
+3.641	word.lower():friedan
+3.614	word.lower():whittington
+3.596	-1:word.lower():spain
+3.589	word.lower():larose
+3.559	word.lower():preval
+3.555	word.lower():enkhbayar
… 6345 more positive …
… 1308 more negative …
-3.533	word.lower():venezuela

Weight^?	Feature
+4.163	word.lower():obama
+3.625	+1:word.lower():advisor
+3.517	word.lower():pressewednesday
+3.464	+1:word.lower():timothy
+3.230	+1:word.lower():gao
+3.191	+1:word.lower():fighters
+3.102	-1:word.lower():michael
+3.079	word.lower():gates
+2.944	-1:word.lower():david
+2.912	-1:word.lower():davis
+2.906	word.lower():ahmed
+2.879	-1:word.lower():condoleezza
+2.850	word.lower():laden
+2.782	+1:word.lower():hui
+2.743	-1:word.lower():bashar
+2.710	+1:word.lower():atal
+2.675	-1:word.lower():viktor
+2.572	-1:word.lower():paul
+2.569	word.lower():christians
+2.561	+1:word.lower():convoy
+2.559	word.lower():rice
+2.541	+1:word.lower():legally
+2.525	-1:word.lower():donald
+2.493	word.lower():milosevic
+2.477	word.lower():gration
+2.459	+1:word.lower():saeb
+2.450	word.lower():mcalpine
+2.441	+1:word.lower():udi
… 5553 more positive …
… 1380 more negative …
-3.158	-1:word.lower():sri
-4.190	word[-3:]:day

Weight^?	Feature
+7.226	word.lower():multi-candidate
+6.381	word.lower():february
+6.335	word.lower():january
+6.181	word.lower():2000
+6.126	word.lower():one-year
+5.950	word.lower():weekend
+5.557	+1:word.lower():week
+5.225	word.lower():august
+5.199	word.lower():december
+4.961	word.lower():september
+4.783	word.lower():april
+4.752	word.lower():june
+4.652	word.lower():1980s
+4.591	word[-3:]:Day
+4.549	word.lower():october
+4.548	word.lower():november
+4.519	word.lower():eucharist
+4.388	-1:word.lower():week
+4.344	word.lower():titan
+4.286	word.lower():half-hour
+4.273	word.lower():mid-afternoon
+4.251	+1:word.lower():year
+4.237	word.lower():midnight
+4.117	word.lower():one-fourth
+4.097	+1:word.lower():working-age
+4.016	word.lower():quarter-century
+3.977	word.lower():pre-season
+3.910	word.lower():non-residents
+3.880	word.lower():mid-week
+3.853	-1:word.lower():cannes
… 3173 more positive …
… 856 more negative …

Weight^?	Feature
+4.467	+1:word.lower():stocky
+4.098	+1:word.lower():old
+4.080	word.lower():working-age
+3.831	word.lower():2000
+3.821	word.lower():april
+3.654	+1:word.lower():jose
+3.597	-1:word.lower():this
+3.468	+1:word.lower():reflected
+3.407	+1:word.lower():month
+3.403	-1:word.lower():past
+3.230	word.lower():weekend
+3.164	word.lower():evening
+3.053	+1:word.lower():katrina
+3.035	word.lower():january
+3.020	+1:word.lower():population
+3.009	+1:word.lower():ago
+2.806	-1:word.lower():nov.
+2.757	word[-3:]:.m.
+2.757	word[-2:]:m.
+2.754	+1:word.lower():removed
+2.748	-1:word.lower():earlier
+2.734	+1:word.lower():ukrainian
+2.730	+1:word.lower():early
+2.727	-1:word.lower():ecuador
+2.726	-1:word.lower():uganda
+2.700	word.lower():august
+2.695	+1:word.lower():year
+2.650	-1:word.lower():second
… 2169 more positive …
… 459 more negative …
-2.776	word[-3:]:way
-2.900	+1:word.lower():3

Improve the model with regularization

Puh, it looks like the CRF just remembering a lot of words. For example for the tag ‘B-per’, the algorithm remembers ‘president’ ‘obama’. To overcome this issue we can tune the parameters, especially the regularization parameters of the CRF algorithm. The $c_1$ and $c_2$ parameter of the CRF algorithm are the regularization parameters $\lambda_1$ and $\lambda_2$ . While $c_1$ weights the $l_1$ regularization, the $c_2$ parameter weights the $l_2$ regularization. We know limit the number of features used by enforcing sparsity on the parameter vector $w$ . To do this we increase the $l_1$ -regularization parameter $c_1$ .

crf = CRF(algorithm='lbfgs',
c1=10,
c2=0.1,
max_iterations=100,
all_possible_transitions=False)

pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)

report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

         precision    recall  f1-score   support
B-art       0.00      0.00      0.00       402
B-eve       0.80      0.27      0.40       308
B-geo       0.82      0.90      0.86     37644
B-gpe       0.95      0.92      0.94     15870
B-nat       0.69      0.09      0.16       201
B-org       0.78      0.67      0.72     20143
B-per       0.80      0.76      0.78     16990
B-tim       0.93      0.83      0.88     20333
I-art       0.00      0.00      0.00       297
I-eve       0.64      0.12      0.20       253
I-geo       0.81      0.73      0.77      7414
I-gpe       0.93      0.37      0.53       198
I-nat       0.00      0.00      0.00        51
I-org       0.75      0.76      0.75     16784
I-per       0.80      0.90      0.85     17251
I-tim       0.84      0.67      0.74      6528
O      		0.99      0.99      0.99    887908
avg / total 0.96      0.97      0.96   1048575

This looks quite nice.

crf.fit(X, y)

CRF(algorithm='lbfgs', all_possible_states=None,
    all_possible_transitions=False, averaging=None, c=None, c1=10, c2=0.1,
    calibration_candidates=None, calibration_eta=None,
    calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None,
    num_memories=None, pa_type=None, period=None, trainer_cls=None,
    variance=None, verbose=False)

Now we look again at the features.

eli5.show_weights(crf, top=30)

From \ To	O	B-art	I-art	B-eve	I-eve	B-geo	I-geo	B-gpe	I-gpe	B-nat	I-nat	B-org	I-org	B-per	I-per	B-tim	I-tim
O	4.037	2.614	0.0	2.167	0.0	2.069	0.0	1.64	0.0	1.788	0.0	2.589	0.0	4.301	0.0	2.546	0.0
B-art	-0.185	0.0	7.041	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
I-art	-0.398	0.0	7.378	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
B-eve	-0.422	0.0	0.0	0.0	8.084	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
I-eve	0.0	0.0	0.0	0.0	7.19	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
B-geo	1.012	0.0	0.0	0.0	0.0	0.0	10.604	0.969	0.0	0.0	0.0	0.788	0.0	0.502	0.0	2.172	0.0
I-geo	-0.991	0.0	0.0	0.0	0.0	0.0	7.889	-0.0	0.0	0.0	0.0	-0.005	0.0	-0.2	0.0	-0.144	0.0
B-gpe	1.064	0.0	0.0	0.0	0.0	0.0	0.0	0.0	4.568	0.0	0.0	1.227	0.0	1.479	0.0	0.0	0.0
I-gpe	-0.259	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
B-nat	-0.363	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	6.942	0.0	0.0	0.0	0.0	0.0	0.0
I-nat	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
B-org	0.597	0.0	0.0	0.0	0.0	0.0	0.0	0.769	0.0	0.0	0.0	0.0	8.056	1.782	0.0	0.003	0.0
I-org	-0.344	0.0	0.0	0.0	0.0	-0.835	0.0	0.0	0.0	0.0	0.0	0.0	7.078	1.399	0.0	0.27	0.0
B-per	-0.102	0.0	0.0	0.0	0.0	0.762	0.0	0.526	0.0	0.0	0.0	1.458	0.0	0.0	6.393	0.0	0.0
I-per	0.138	0.0	0.0	0.0	0.0	-0.095	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	6.135	1.062	0.0
B-tim	1.03	0.0	0.0	0.0	0.0	0.017	0.0	-0.573	0.0	0.0	0.0	0.0	0.0	0.084	0.0	0.0	8.243
I-tim	-0.08	0.0	0.0	0.0	0.0	0.0	0.0	-0.903	0.0	0.0	0.0	-0.133	0.0	0.0	0.0	0.0	7.534