Named entity recognition with conditional random fields in python
This is the second post in my series about named entity recognition. If you haven’t seen the first one, have a look now. Last time we started by memorizing entities for words and then used a simple classification model to improve the results a bit. This model also used context properties and the structure of the word in question. But the results where not overwhelmingly good, so now we’re going to look into a more sophisticated algorithm, a so called conditional random field (CRF).
We denote x=(x1,…,xm) as the input sequence, i.e. the words of a sentence and s=(s1,…,sm) as the sequence of output states, i.e. the named entity tags. In conditional random fields we model the conditional probability
p(s1,…,sm∣x1,…,xm).
We do this by define a feature map
Φ(x1,…,xm,s1,…,sm)∈Rd
that maps an entire input sequence x paired with an entire state sequence s to some d-dimensional feature vector. Then we can model the probability as a log-linear model with the parameter vector w∈Rdp(s∣x;w)=∑s′exp(w⋅Φ(x,s′))exp(w⋅Φ(x,s)),
where s′ ranges over all possible output sequences. For the estimation of w, we assume that we have a set of n labeled examples (xi,si)i=1n. Now we define the regularized log-likelihood function LL(w)=i=1∑nlogp(si∣xi;w)−2λ2∣w∣22−λ1∣w∣1.
The terms 2λ2∣w∣22 and λ1∣w∣1 forces the parameter vector to be small in the respective norm. This penalizes the model complexity and is known as regularization. The parameters λ2 and λ1 allows to enforce more or less regularization. The parameter vector w∗ is then estimated as
w∗=arg maxw∈RdL(w)
If we estimated the vector w∗, we can find the most likely tag a sentence s∗ for a sentence x by
s∗=arg maxsp(s∣x;w∗).
For more details we refer to M.Collins [http://www.cs.columbia.edu/~mcollins/crf.pdf].
Load the dataset
If you want to run the tutorial yourself, you can find the dataset here.
Now we want to apply this model. Let’s start by loading the data.
importpandasaspdimportnumpyasnp
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.tail(10)
Sentence #
Word
POS
Tag
1048565
Sentence: 47958
impact
NN
O
1048566
Sentence: 47958
.
.
O
1048567
Sentence: 47959
Indian
JJ
B-gpe
1048568
Sentence: 47959
forces
NNS
O
1048569
Sentence: 47959
said
VBD
O
1048570
Sentence: 47959
they
PRP
O
1048571
Sentence: 47959
responded
VBD
O
1048572
Sentence: 47959
to
TO
O
1048573
Sentence: 47959
the
DT
O
1048574
Sentence: 47959
attack
NN
O
words =list(set(data["Word"].values))
n_words =len(words); n_words
35178
So we have 47959 sentences containing 35178 different words. We change the SentenceGetter class from last post a little and use it to retrieve sentences with their labels.
classSentenceGetter(object):
def __init__(self, data):
self.n_sent =1
self.data = data
self.empty = False
agg_func =lambda s: [(w, p, t) for w, p, t inzip(s["Word"].values.tolist(),
s["POS"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
defget_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent +=1return s
except:
return None
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)
We will use the scikit-learn classification report to evaluate the tagger, because we are basically interested in precision, recall and the f1-score. These metrics are common in NLP tasks and if you are not familiar with these metrics, then check out the wikipedia articles.
The nice thing about CRFs is, that we can look into the algorithm and visualize the transition probabilites from one tag to another. We also can see which features are important for predicting a certain tag. We use the eli5 library to performe the investigation.
importeli5
eli5.show_weights(crf, top=30)
From \ To
O
B-art
I-art
B-eve
I-eve
B-geo
I-geo
B-gpe
I-gpe
B-nat
I-nat
B-org
I-org
B-per
I-per
B-tim
I-tim
O
4.29
0.879
0.0
1.575
0.0
2.092
0.0
1.387
0.0
1.605
0.0
2.497
0.0
4.17
0.0
2.986
0.0
B-art
-0.014
0.0
8.442
0.0
0.0
-0.398
0.0
0.0
0.0
0.0
0.0
0.516
0.0
-0.844
0.0
0.336
0.0
I-art
-0.651
0.0
8.04
0.0
0.0
-0.702
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.016
0.0
-0.684
0.0
B-eve
-0.753
0.0
0.0
0.0
7.956
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.572
0.0
I-eve
-0.324
0.0
0.0
0.0
7.341
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-0.621
0.0
B-geo
0.677
0.752
0.0
0.545
0.0
0.0
8.752
0.579
0.0
0.0
0.0
1.155
0.0
1.143
0.0
2.344
0.0
I-geo
-0.469
0.822
0.0
0.0
0.0
0.0
7.424
-1.366
0.0
0.0
0.0
-0.074
0.0
1.331
0.0
1.033
0.0
B-gpe
0.679
-1.609
0.0
-0.32
0.0
0.681
0.0
0.0
7.485
0.0
0.0
2.05
0.0
1.459
0.0
0.767
0.0
I-gpe
-0.298
0.0
0.0
0.0
0.0
-1.087
0.0
0.0
6.337
0.0
0.0
0.0
0.0
0.148
0.0
0.0
0.0
B-nat
-1.108
0.0
0.0
0.0
0.0
0.625
0.0
0.0
0.0
0.0
7.067
0.0
0.0
-0.305
0.0
-0.413
0.0
I-nat
-1.979
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
5.197
0.0
0.0
1.188
0.0
0.0
0.0
B-org
0.051
1.32
0.0
0.0
0.0
-0.331
0.0
0.447
0.0
0.0
0.0
0.0
7.109
1.054
0.0
0.255
0.0
I-org
-0.242
0.0
0.0
0.0
0.0
-1.562
0.0
0.573
0.0
0.0
0.0
0.0
7.236
1.639
0.0
0.421
0.0
B-per
0.364
0.0
0.0
0.0
0.0
0.723
0.0
0.734
0.0
2.176
0.0
2.405
0.0
0.0
7.146
1.165
0.0
I-per
0.18
0.0
0.0
0.0
0.0
-2.072
0.0
-1.568
0.0
0.0
0.0
-0.341
0.0
0.0
6.299
1.055
0.0
B-tim
0.286
-1.079
0.0
0.249
0.0
-0.083
0.0
-1.338
0.0
0.061
0.0
-0.148
0.0
1.338
0.0
0.0
7.245
I-tim
-0.263
0.0
0.0
0.072
0.0
-0.11
0.0
-1.437
0.0
0.0
0.0
-0.374
0.0
1.854
0.0
0.0
7.069
y=O
top features
y=B-art
top features
y=I-art
top features
y=B-eve
top features
y=I-eve
top features
y=B-geo
top features
y=I-geo
top features
y=B-gpe
top features
y=I-gpe
top features
y=B-nat
top features
y=I-nat
top features
y=B-org
top features
y=I-org
top features
y=B-per
top features
y=I-per
top features
y=B-tim
top features
y=I-tim
top features
Weight?
Feature
+8.012
word.lower():last
+7.999
word.lower():month
+5.813
word.lower():chairman
+5.612
word.lower():columbia
+5.555
word.lower():year
+5.232
word.lower():week
+5.146
word.lower():months
+5.067
word.lower():internet
+4.833
word.lower():weeks
+4.726
word.lower():after
+4.684
word.lower():republicans
+4.558
word[-3:]:And
+4.436
word.lower():ambassador
+4.406
word.lower():chief
+4.383
word.lower():trade
+4.344
word.lower():early
+4.272
word.lower():years
+4.216
+1:word.lower():americans
+4.140
word.lower():tourism
+4.127
+1:word.lower():american
+4.079
word.lower():christian
+4.075
word.lower():spokesman
+4.060
word[-3:]:De
+4.060
word[-2:]:De
… 9204 more positive …
… 5208 more negative …
-4.091
word[-2:]:0s
-4.158
word.lower():afternoon
-4.447
word.lower():palestinian
-4.515
word.lower():summer
-4.607
word.lower():morning
-4.801
word.lower():multi-party
Weight?
Feature
+5.369
word.lower():twitter
+4.858
word.lower():spaceshipone
+4.294
word.lower():nevirapine
+4.271
+1:word.lower():enkhbayar
+4.263
+1:word.lower():boots
+3.893
word.lower():english
+3.802
-1:word.lower():engine
+3.655
word[-3:]:One
+3.588
-1:word.lower():film
+3.540
word.lower():russian
+3.499
word.lower():canal
+3.397
+1:word.lower():al-arabiya
+3.345
-1:word.lower():adumim
+3.237
word.lower():sopranos
+3.186
-1:word.lower():to
+3.150
word.lower():spanish
+3.130
-1:word.lower():shown
+3.014
word.lower():economics
+3.006
-1:word.lower():tamilnet
+2.997
word.lower():frankenstadion
+2.973
word.lower():settlement
+2.936
word[-2:]:00
+2.919
word.lower():dollar
+2.889
-1:word.lower():republic
+2.889
+1:word.lower():helicopters
+2.877
+1:word.lower():search
+2.875
-1:word.lower():program
+2.831
word.lower():endeavor
+2.711
word[-3:]:vor
+2.685
word.lower():sidnaya
… 957 more positive …
… 81 more negative …
Weight?
Feature
+3.025
-1:word.lower():boeing
+2.553
+1:word.lower():gained
+2.473
+1:word.lower():came
+2.418
-1:word.lower():cajun
+2.297
word.lower():notice
+2.260
word.lower():constitution
+2.112
word.lower():flowers
+2.109
+1:word.lower():times
+2.072
+1:word.lower():marks
+2.056
word.lower():a
+2.048
+1:word.lower():teshome
+1.980
+1:word.lower():treaty
+1.876
+1:word.lower():expands
+1.875
+1:word.lower():reports
+1.866
-1:word.lower():dignity
+1.859
word.lower():dome
+1.852
+1:word.lower():early
+1.844
+1:word.lower():roses
+1.805
-1:word.lower():jerusalem
+1.800
-1:word.lower():balad
+1.793
+1:word.lower():outside
+1.779
word.lower():monument
+1.774
-1:word.lower():baghdad
+1.765
-1:word.lower():beijing
+1.757
+1:word.lower():rival
+1.747
-1:word.lower():hitler
+1.668
word[-3:]:One
+1.667
word.lower():lies
+1.660
word.lower():declaration
+1.645
word.lower():mustard
… 882 more positive …
… 81 more negative …
Weight?
Feature
+4.333
word.lower():games
+4.263
word.lower():ramadan
+4.160
-1:word.lower():falklands
+3.501
-1:word.lower():typhoon
+3.484
word[-3:]:mes
+3.050
+1:word.lower():dean
+3.046
+1:word.lower():men
+3.028
-1:word.lower():wars
+2.942
-1:word.lower():happy
+2.938
-1:word.lower():solemn
+2.915
word.lower():hopman
+2.899
word.lower():katrina
+2.846
word.lower():olympic
+2.843
word[-3:]:pic
+2.758
-1:word.lower():war
+2.745
word.lower():parma
+2.714
-1:word.lower():midnight
+2.596
word.lower():australian
+2.570
-1:word.lower():2002
+2.547
+1:word.lower():security
+2.518
+1:word.lower():sabbath
+2.454
+1:word.lower():open
+2.446
+1:word.lower():event
+2.442
word.lower():passover
+2.433
-1:word.lower():nazi
+2.409
+1:word.lower():ends
+2.390
word.lower():holocaust
+2.350
-1:word.lower():reigning
+2.262
word[-3:]:mme
+2.262
word.lower():somme
… 437 more positive …
… 49 more negative …
Weight?
Feature
+4.329
+1:word.lower():mascots
+3.603
word.lower():games
+3.022
+1:word.lower():era
+2.756
word.lower():series
+2.577
word.lower():dean
+2.509
+1:word.lower():rally
+2.508
+1:word.lower():caused
+2.504
+1:word.lower():disaster
+2.441
word.lower():sabbath
+2.426
+1:word.lower():tore
+2.420
+1:word.lower():without
+2.230
-1:word.lower():jewish
+2.220
+1:word.lower():now
+2.216
+1:word.lower():project
+2.164
+1:word.lower():suicide
+2.112
-1:word.lower():awareness
+1.940
+1:word.lower():holiday
+1.916
+1:word.lower():peace
+1.880
word[-3:]:ean
+1.861
-1:word.lower():hurricane
+1.831
+1:word.lower():even
+1.828
+1:word.lower():finals
+1.762
word.lower():conference
+1.760
-1:word.lower():typhoon
+1.753
-1:word.lower():may
+1.743
+1:word.lower():tennis
+1.712
-1:word.lower():rights
+1.702
word.lower():year
+1.699
+1:word.lower():olympics
+1.696
word.lower():awareness
… 393 more positive …
… 64 more negative …
Weight?
Feature
+6.238
word.lower():mid-march
+6.002
word.lower():caribbean
+5.503
word.lower():martian
+5.446
word.lower():beijing
+5.086
word.lower():persian
+4.737
-1:word.lower():hamas
+4.521
-1:word.lower():mr.
+4.509
word.lower():balkans
+4.362
-1:word.lower():serb
+4.310
word.lower():quake-zone
+4.224
word.lower():philippines
+4.192
word.lower():burma
+4.169
+1:word.lower():phoned
+4.167
word.lower():washington
+4.152
word.lower():france
+4.137
word.lower():paris
+4.131
-1:word.lower():taleban
+4.016
-1:word.lower():bordeaux
+3.943
word.lower():mars
+3.900
+1:word.lower():moqtada
+3.886
-1:word.lower():cypriot
+3.870
word.lower():mid-june
+3.837
word.lower():wheeler
+3.788
word.lower():pearl
+3.744
-1:word.lower():malaysian
+3.698
word.lower():athens
+3.616
word.lower():séances
+3.616
word.lower():port-au-prince
+3.589
word.lower():christians
… 5949 more positive …
… 1365 more negative …
-4.659
word[-3:]:The
Weight?
Feature
+4.211
word.lower():led-invasion
+4.151
word.lower():holiday
+4.065
word.lower():caribbean
+3.651
+1:word.lower():possessions
+3.446
+1:word.lower():regional
+3.430
+1:word.lower():french
+3.374
-1:word.lower():nahr
+3.296
-1:word.lower():tokugawa
+3.296
word.lower():shogunate
+3.232
word.lower():restaurant
+3.127
word.lower():island
+3.063
word.lower():autonomy
+3.059
+1:word.lower():produced
+3.054
-1:word.lower():kennedy
+2.992
-1:word.lower():christmas
+2.890
word.lower():ocean
+2.885
word.lower():east
+2.852
+1:word.lower():block
+2.826
-1:word.lower():sumatran
+2.745
-1:word.lower():surma
+2.721
-1:word.lower():john
+2.675
word.lower():subway
+2.645
+1:word.lower():crude
+2.635
+1:word.lower():service
+2.623
+1:word.lower():holidays
+2.593
word.lower():lions
+2.482
+1:word.lower():islamic
+2.409
+1:word.lower():crisis
… 2989 more positive …
… 525 more negative …
-2.367
word[-3:]:ost
-2.493
word[-3:]:day
Weight?
Feature
+6.735
word.lower():afghan
+6.602
word.lower():niger
+6.219
word.lower():nepal
+5.432
word.lower():spaniard
+5.391
word.lower():azerbaijan
+5.138
word.lower():iranian
+5.127
word.lower():mexican
+5.080
word.lower():argentine
+4.926
word.lower():gibraltar
+4.829
word.lower():iraqi
+4.706
word.lower():spaniards
+4.662
word.lower():croats
+4.638
word.lower():venezuelan
+4.599
word.lower():cuban
+4.526
word.lower():korean
+4.526
word.lower():polish
+4.480
word.lower():aussies
+4.313
word.lower():bahamas
+4.301
word.lower():syrian
+4.280
word.lower():andorra
+4.278
word.lower():jordan
+4.271
word.lower():turkish
+4.234
word.lower():madagonia
+4.226
word.lower():chechen
+4.224
word.lower():chilean
+4.215
word.lower():kenyan
+4.209
word.lower():irish
+4.206
word.lower():egyptian
+4.191
word.lower():palestinians
+4.147
word.istitle()
… 1434 more positive …
… 505 more negative …
Weight?
Feature
+5.622
+1:word.lower():mayor
+4.073
-1:word.lower():democratic
+3.844
-1:word.lower():bosnian
+3.602
+1:word.lower():developed
+3.543
word.lower():korean
+3.308
word[-3:]:can
+3.226
-1:word.lower():soviet
+3.217
word.lower():city
+3.179
+1:word.lower():health
+3.172
word.lower():cypriots
+3.000
word.lower():britons
+2.857
+1:word.lower():under
+2.841
+1:word.lower():iraq
+2.737
+1:word.lower():invaded
+2.619
+1:word.lower():man
+2.601
+1:word.lower():returned
+2.547
-1:word.lower():islamic
+2.532
+1:word.lower():did
+2.471
+1:word.lower():also
+2.449
word[-2:]:bs
+2.327
word.lower():indians
+2.307
word.lower():cypriot
+2.294
word[-3:]:iot
+2.220
word[-3:]:ots
+2.188
-1:word.lower():panama
+2.159
+1:word.lower():began
+2.109
word[-3:]:ovy
+2.109
word.lower():muscovy
+2.095
+1:word.lower():countries
+2.067
word[-2:]:ot
… 207 more positive …
… 40 more negative …
Weight?
Feature
+6.149
word.lower():katrina
+5.371
word.lower():marburg
+4.334
word.lower():rita
+3.535
+1:word.lower():shot
+2.959
word[-3:]:ita
+2.791
word.lower():leukemia
+2.769
word[-3:]:urg
+2.759
word[-3:]:mia
+2.665
word.lower():paul
+2.647
+1:word.lower():strain
+2.595
word[-2:]:N1
+2.552
word.lower():ebola
+2.505
word[-3:]:5N1
+2.505
word.lower():h5n1
+2.505
+1:word.lower():immunization
+2.454
word[-3:]:aul
+2.444
word.lower():danielle
+2.379
+1:word.lower():lives
+2.349
word.lower():acc
+2.349
word[-3:]:ACC
+2.337
-1:word.lower():often-deadly
+2.322
-1:word.lower():7,000
+2.280
word[-2:]:TB
+2.222
+1:word.lower():epidemics
+2.174
word[-2:]:rg
+2.158
word.isupper()
+2.147
+1:word.lower():should
+2.140
-1:word.lower():case
+2.133
word.lower():amur
+2.121
+1:word.lower():correctly
… 242 more positive …
… 39 more negative …
Weight?
Feature
+2.681
word.lower():rita
+2.327
word[-3:]:ita
+2.315
+1:word.lower():outbreak
+1.944
-1:word.lower():hurricanes
+1.909
word[-2:]:ta
+1.747
word.lower():flu
+1.670
word[-2:]:lu
+1.654
-1:word.lower():type
+1.624
+1:word.lower():relief
+1.613
-1:postag:NN
+1.572
-1:word.istitle()
+1.570
-1:word.lower():heart
+1.471
+1:word.lower():last
+1.422
+1:word.lower():slammed
+1.421
-1:word.lower():jing
+1.421
word.lower():jing
+1.400
word.lower():katrina
+1.280
+1:word.lower():says
+1.171
word.lower():disease
+1.170
-1:word.lower():hurricane
+1.153
-1:word.lower():avian
+1.137
word.lower():circumpolar
+1.121
word[-3:]:Flu
+1.092
-1:word.lower():antarctic
+1.068
-1:word.lower():circumpolar
+1.066
word[-3:]:ase
+1.051
+1:word.lower():current
+1.050
word[-3:]:ina
+1.045
word.lower():current
+1.036
word[-2:]:ba
… 91 more positive …
… 20 more negative …
Weight?
Feature
+7.344
word.lower():philippine
+6.075
word.lower():mid-march
+5.812
word.lower():hamas
+5.779
-1:word.lower():rice
+5.629
word.lower():al-qaida
+5.071
word.lower():taleban
+4.756
word.lower():taliban
+4.729
-1:word.lower():senator
+4.723
word.lower():reuters
+4.662
word.lower():hezbollah
+4.618
word.lower():university
+4.565
word.lower():conocophillips
+4.295
word.lower():boeing
+4.269
word.lower():senate
+4.244
word.lower():constantinople
+4.240
word.lower():kindhearts
+4.141
word.lower():boers
+4.092
-1:word.lower():singh
+4.061
word.lower():exxonmobil
+4.054
-1:word.lower():nepal
+4.002
word.lower():yukos
+3.997
word.lower():munich
+3.969
-1:word.lower():niger
+3.943
word.lower():congress
+3.920
word.lower():xinhua
+3.909
word.lower():mcdonald
+3.907
word.lower():daimlerchrysler
+3.845
word.lower():convergence
+3.845
-1:word.lower():israel
+3.824
-1:word.lower():semi-autonomous
… 6796 more positive …
… 1476 more negative …
Weight?
Feature
+3.981
+1:word.lower():attained
+3.785
+1:word.lower():reporter
+3.486
-1:word.lower():associated
+3.463
word.lower():singapore
+3.400
word.lower():member-countries
+3.365
-1:word.lower():decathlon
+3.360
+1:word.lower():ohlmert
+3.343
word.lower():times
+3.335
word.lower():member-states
+3.282
+1:word.lower():separating
+3.264
-1:word.lower():&
+3.156
+1:word.lower():mulgueta
+3.127
word.lower():nations
+3.126
word.lower():holiday
+3.099
word.lower():decathlon
+3.067
+1:word.lower():ms.
+3.063
+1:word.lower():1947
+3.041
word.lower():airlines
+3.029
word.lower():washington
+2.900
+1:word.lower():post
+2.884
word.lower():relief
+2.880
word.lower():protests
+2.877
+1:word.lower():mil
+2.855
word.lower():ohlmert
… 6749 more positive …
… 1545 more negative …
-3.007
-1:word.lower():hamas
-3.224
-1:word.lower():minister
-3.233
word[-2:]:hn
-3.909
word[-2:]:lf
-4.079
word.lower():city
-4.283
word.lower():secretary
Weight?
Feature
+7.301
word.lower():president
+6.125
word.lower():obama
+5.647
word.lower():senator
+5.367
word.lower():greenspan
+5.325
word.lower():vice
+4.824
word.lower():western
+4.721
word.lower():hall
+4.600
word.lower():prime
+4.541
word.lower():clinton
+4.510
word.lower():frank
+4.383
word.lower():cobain
+4.318
word.lower():milosevic
+4.177
word.lower():brent
+4.002
word[-2:]:r.
+3.953
word.lower():johnston
+3.919
word.lower():spears
+3.823
word.lower():zidane
+3.811
word.lower():al-zarqawi
+3.796
word.lower():mccain
+3.771
word.lower():toure
+3.722
word.lower():barghouti
+3.670
word.lower():rice
+3.660
+1:word.lower():extra
+3.641
word.lower():friedan
+3.614
word.lower():whittington
+3.596
-1:word.lower():spain
+3.589
word.lower():larose
+3.559
word.lower():preval
+3.555
word.lower():enkhbayar
… 6345 more positive …
… 1308 more negative …
-3.533
word.lower():venezuela
Weight?
Feature
+4.163
word.lower():obama
+3.625
+1:word.lower():advisor
+3.517
word.lower():pressewednesday
+3.464
+1:word.lower():timothy
+3.230
+1:word.lower():gao
+3.191
+1:word.lower():fighters
+3.102
-1:word.lower():michael
+3.079
word.lower():gates
+2.944
-1:word.lower():david
+2.912
-1:word.lower():davis
+2.906
word.lower():ahmed
+2.879
-1:word.lower():condoleezza
+2.850
word.lower():laden
+2.782
+1:word.lower():hui
+2.743
-1:word.lower():bashar
+2.710
+1:word.lower():atal
+2.675
-1:word.lower():viktor
+2.572
-1:word.lower():paul
+2.569
word.lower():christians
+2.561
+1:word.lower():convoy
+2.559
word.lower():rice
+2.541
+1:word.lower():legally
+2.525
-1:word.lower():donald
+2.493
word.lower():milosevic
+2.477
word.lower():gration
+2.459
+1:word.lower():saeb
+2.450
word.lower():mcalpine
+2.441
+1:word.lower():udi
… 5553 more positive …
… 1380 more negative …
-3.158
-1:word.lower():sri
-4.190
word[-3:]:day
Weight?
Feature
+7.226
word.lower():multi-candidate
+6.381
word.lower():february
+6.335
word.lower():january
+6.181
word.lower():2000
+6.126
word.lower():one-year
+5.950
word.lower():weekend
+5.557
+1:word.lower():week
+5.225
word.lower():august
+5.199
word.lower():december
+4.961
word.lower():september
+4.783
word.lower():april
+4.752
word.lower():june
+4.652
word.lower():1980s
+4.591
word[-3:]:Day
+4.549
word.lower():october
+4.548
word.lower():november
+4.519
word.lower():eucharist
+4.388
-1:word.lower():week
+4.344
word.lower():titan
+4.286
word.lower():half-hour
+4.273
word.lower():mid-afternoon
+4.251
+1:word.lower():year
+4.237
word.lower():midnight
+4.117
word.lower():one-fourth
+4.097
+1:word.lower():working-age
+4.016
word.lower():quarter-century
+3.977
word.lower():pre-season
+3.910
word.lower():non-residents
+3.880
word.lower():mid-week
+3.853
-1:word.lower():cannes
… 3173 more positive …
… 856 more negative …
Weight?
Feature
+4.467
+1:word.lower():stocky
+4.098
+1:word.lower():old
+4.080
word.lower():working-age
+3.831
word.lower():2000
+3.821
word.lower():april
+3.654
+1:word.lower():jose
+3.597
-1:word.lower():this
+3.468
+1:word.lower():reflected
+3.407
+1:word.lower():month
+3.403
-1:word.lower():past
+3.230
word.lower():weekend
+3.164
word.lower():evening
+3.053
+1:word.lower():katrina
+3.035
word.lower():january
+3.020
+1:word.lower():population
+3.009
+1:word.lower():ago
+2.806
-1:word.lower():nov.
+2.757
word[-3:]:.m.
+2.757
word[-2:]:m.
+2.754
+1:word.lower():removed
+2.748
-1:word.lower():earlier
+2.734
+1:word.lower():ukrainian
+2.730
+1:word.lower():early
+2.727
-1:word.lower():ecuador
+2.726
-1:word.lower():uganda
+2.700
word.lower():august
+2.695
+1:word.lower():year
+2.650
-1:word.lower():second
… 2169 more positive …
… 459 more negative …
-2.776
word[-3:]:way
-2.900
+1:word.lower():3
Improve the model with regularization
Puh, it looks like the CRF just remembering a lot of words. For example for the tag ‘B-per’, the algorithm remembers ‘president’ ‘obama’. To overcome this issue we can tune the parameters, especially the regularization parameters of the CRF algorithm. The c1 and c2 parameter of the CRF algorithm are the regularization parameters λ1 and λ2. While c1 weights the l1 regularization, the c2 parameter weights the l2 regularization. We know limit the number of features used by enforcing sparsity on the parameter vector w. To do this we increase the l1-regularization parameter c1.
As expected, we see, that the model stops to rely on words and uses the context more, as it generalizes better is more useful over multiple training instances. This is an effect of the l1-regularization.
This is it for this time, but stay tuned for the next post, where we will look at named entity recognition with recurrent neural networks.