Today we want to build a model, that can identify ingredients in cooking recipes. I use the “German Recipes Dataset”, I recently published on kaggle. We have more than 12000 German recipes and their ingredients list. First we will generate labels for every word in the recipe, if it is an ingredient or not. Then we use a sequence-to-sequence neural network to tag every word. Then we pseudo-label the training set and update the model with the new labels.
import numpy as np
import pandas as pd
Load the recipes
df = pd.read_json("../input/recipes.json")
df.Instructions[2]
'Die Kirschen abtropfen lassen, dabei den Saft auffangen. Das Puddingpulver mit dem Vanillezucker mischen und mit 6 EL Saft glatt rühren. Den übrigen Kirschsaft aufkochen und vom Herd nehmen. Das angerührte Puddingpulver einrühren und unter Rühren ca. eine Minute köcheln. Die Kirschen unter den angedickten Saft geben. Milch, 40 g Zucker, Vanillemark und Butter aufkochen. Den Topf vom Herd ziehen und den Grieß unter Rühren einstreuen. Unter Rühren einmal aufkochen lassen und zugedeckt ca. fünf Minuten quellen lassen.In der Zeit das Ei trennen. Das Eiweiß mit einer Prise Salz steif schlagen und dabei die restlichen 20 g Zucker einrieseln lassen. Das Eigelb unter den Brei rühren und dann das Eiweiß unterheben.Den Grießbrei mit dem Kompott servieren.'
We put some recipes aside for later evaluation.
eval_df = df[11000:]
eval_df.shape
(1190, 8)
df = df[:11000]
df.shape
(11000, 8)
Tokenize the texts with spacy
!python -m spacy download de_core_news_sm
Collecting de_core_news_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz#egg=de_core_news_sm==2.0.0
[?25l Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz (38.2MB)
[K 100% |████████████████████████████████| 38.2MB 85.4MB/s ta 0:00:011 19% |██████▏ | 7.3MB 87.8MB/s eta 0:00:01
[?25hInstalling collected packages: de-core-news-sm
Running setup.py install for de-core-news-sm ... [?25ldone
[?25hSuccessfully installed de-core-news-sm-2.0.0
[93m Linking successful[0m
/opt/conda/lib/python3.6/site-packages/de_core_news_sm -->
/opt/conda/lib/python3.6/site-packages/spacy/data/de_core_news_sm
You can now load the model via spacy.load('de_core_news_sm')
import spacy
nlp = spacy.load('de_core_news_sm', disable=['parser', 'tagger', 'ner'])
We run the spacy tokenizer on all instructions.
tokenized = [nlp(t) for t in df.Instructions.values]
And now we build a vocabulary of known tokens.
vocab = {"<UNK>": 1, "<PAD>": 0}
for txt in tokenized:
for token in txt:
if token.text not in vocab.keys():
vocab[token.text] = len(vocab)
print("Number of unique tokens: {}".format(len(vocab)))
Number of unique tokens: 17687
Create the labels
What is missing now, are the labels. We need to know where in the text are ingredients. We will try to bootstrap this information from the provided ingredients list.
ingredients = df.Ingredients
ingredients[0]
['600 g Hackfleisch, halb und halb',
'800 g Sauerkraut',
'200 g Wurst, geräucherte (Csabai Kolbász)',
'150 g Speck, durchwachsener, geräucherter',
'100 g Reis',
'1 m.-große Zwiebel(n)',
'1 Zehe/n Knoblauch',
'2 Becher Schmand',
'1/2TL Kümmel, ganzer',
'2 Lorbeerblätter',
'Salz und Pfeffer',
'4 Ei(er) (bei Bedarf)',
'Paprikapulver',
'etwas Wasser',
'Öl']
We first clean the ingredients lists from stopwords, numbers and other stuff.
def _filter(token):
if len(token) < 2:
return False
if token.is_stop:
return False
if token.text[0].islower():
return False
if token.is_digit:
return False
if token.like_num:
return False
return True
def _clean(text):
text = text.replace("(", "")
text = text.split("/")[0]
return text
clean = [_clean(t.text) for i in ingredients[214] for t in nlp(i) if _filter(t) and len(_clean(t.text)) >= 2]
clean
['Rosenkohl',
'Schalotten',
'Hühnerbrühe',
'Milch',
'EL',
'Crème',
'Speck',
'Kartoffelgnocchi']
def get_labels(ingredients, tokenized_instructions):
labels = []
for ing, ti in zip(ingredients, tokenized_instructions):
l_i = []
ci = [_clean(t.text) for i in ing for t in nlp(i) if _filter(t) and len(_clean(t.text)) >= 2]
label = []
for token in ti:
l_i.append(any((c == token.text or c == token.text[:-1] or c[:-1] == token.text) for c in ci))
labels.append(l_i)
return labels
labels = get_labels(ingredients, tokenized)
set([t.text for t, l in zip(tokenized[214], labels[214]) if l])
{'Crème', 'Hühnerbrühe', 'Milch', 'Rosenkohl', 'Schalotten', 'Speck'}
Modelling with a LSTM network
First we have to look at the length of our recipes, to determine the length we want to pad our inputs for the network to.
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist([len([t for t in tokens]) for tokens in tokenized], bins=20);
We picked a maximum length of 400 tokens.
MAX_LEN = 400
Prepare the sequences by padding
Now we pad the sequences and map the words to integers.
from keras.preprocessing.sequence import pad_sequences
def prepare_sequences(texts, max_len, vocab={"<UNK>": 1, "<PAD>": 0}):
X = [[vocab.get(w.text, vocab["<UNK>"]) for w in s] for s in texts]
return pad_sequences(maxlen=max_len, sequences=X, padding="post", value=vocab["<PAD>"])
Using TensorFlow backend.
X_seq = prepare_sequences(tokenized, max_len=MAX_LEN, vocab=vocab)
X_seq[1]
array([192, 193, 194, 183, 195, 196, 128, 197, 9, 198, 199, 200, 201,
202, 203, 60, 204, 205, 9, 13, 206, 15, 23, 98, 207, 208,
51, 209, 68, 202, 203, 25, 6, 195, 125, 202, 210, 211, 212,
33, 45, 213, 214, 100, 196, 13, 215, 216, 217, 33, 9, 68,
218, 219, 213, 169, 35, 82, 100, 220, 221, 202, 6, 222, 45,
223, 48, 224, 33, 67, 225, 100, 226, 6, 227, 228, 229, 130,
45, 92, 85, 230, 211, 231, 6, 232, 233, 234, 235, 145, 157,
236, 9, 237, 238, 104, 239, 210, 240, 157, 241, 54, 6, 109,
242, 243, 244, 245, 246, 187, 247, 6, 248, 183, 249, 250, 33,
129, 13, 251, 252, 101, 253, 33, 254, 9, 31, 255, 40, 172,
6, 2, 256, 257, 177, 258, 259, 260, 33, 42, 261, 262, 263,
131, 264, 265, 266, 33, 267, 74, 268, 269, 68, 270, 6, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
y_seq = []
for l in labels:
y_i = []
for i in range(MAX_LEN):
try:
y_i.append(float(l[i]))
except:
y_i.append(0.0)
y_seq.append(np.array(y_i))
y_seq = np.array(y_seq)
y_seq = y_seq.reshape(y_seq.shape[0], y_seq.shape[1], 1)
Setup the network
Now we can start to setup the model.
import tensorflow as tf
from tensorflow.keras import layers
print(tf.VERSION)
print(tf.keras.__version__)
1.12.0
2.1.6-tf
We build a simple 2-layer LSTM-based sequence tagger with tensorflow.keras.
model = tf.keras.Sequential()
model.add(layers.Embedding(input_dim=len(vocab), mask_zero=True, output_dim=50))
model.add(layers.SpatialDropout1D(0.2))
model.add(layers.Bidirectional(layers.LSTM(units=64, return_sequences=True)))
model.add(layers.SpatialDropout1D(0.2))
model.add(layers.Bidirectional(layers.LSTM(units=64, return_sequences=True)))
model.add(layers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 50) 884350
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, None, 50) 0
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128) 58880
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, None, 128) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 128) 98816
_________________________________________________________________
time_distributed (TimeDistri (None, None, 1) 129
=================================================================
Total params: 1,042,175
Trainable params: 1,042,175
Non-trainable params: 0
_________________________________________________________________
And now we fit it.
history = model.fit(X_seq, y_seq, epochs=10, batch_size=256, validation_split=0.1)
Train on 9900 samples, validate on 1100 samples
Epoch 1/10
9900/9900 [==============================] - 131s 13ms/step - loss: 0.3855 - acc: 0.9019 - val_loss: 0.2982 - val_acc: 0.9108
Epoch 2/10
9900/9900 [==============================] - 127s 13ms/step - loss: 0.2846 - acc: 0.9105 - val_loss: 0.2620 - val_acc: 0.9108
Epoch 3/10
9900/9900 [==============================] - 127s 13ms/step - loss: 0.2379 - acc: 0.9112 - val_loss: 0.1950 - val_acc: 0.9160
Epoch 4/10
9900/9900 [==============================] - 128s 13ms/step - loss: 0.1214 - acc: 0.9528 - val_loss: 0.0729 - val_acc: 0.9746
Epoch 5/10
9900/9900 [==============================] - 130s 13ms/step - loss: 0.0663 - acc: 0.9757 - val_loss: 0.0668 - val_acc: 0.9763
Epoch 6/10
9900/9900 [==============================] - 128s 13ms/step - loss: 0.0616 - acc: 0.9774 - val_loss: 0.0645 - val_acc: 0.9774
Epoch 7/10
9900/9900 [==============================] - 128s 13ms/step - loss: 0.0590 - acc: 0.9785 - val_loss: 0.0625 - val_acc: 0.9782
Epoch 8/10
9900/9900 [==============================] - 126s 13ms/step - loss: 0.0570 - acc: 0.9794 - val_loss: 0.0608 - val_acc: 0.9789
Epoch 9/10
9900/9900 [==============================] - 127s 13ms/step - loss: 0.0545 - acc: 0.9806 - val_loss: 0.0595 - val_acc: 0.9794
Epoch 10/10
9900/9900 [==============================] - 126s 13ms/step - loss: 0.0521 - acc: 0.9815 - val_loss: 0.0564 - val_acc: 0.9808
plt.plot(history.history["loss"], label="trn_loss");
plt.plot(history.history["val_loss"], label="val_loss");
plt.legend();
plt.title("Loss");
plt.plot(history.history["acc"], label="trn_acc");
plt.plot(history.history["val_acc"], label="val_acc");
plt.legend();
plt.title("Accuracy");
Analyse the predictions of the model
Now that the model is trained, we can look at some predictions on the training set.
y_pred = model.predict(X_seq, verbose=1, batch_size=1024)
11000/11000 [==============================] - 10s 945us/step
i = 3343
pred_i = y_pred[i] > 0.05
tokenized[i]
Kohlrabi schälen, waschen und in Stifte schneiden. Brühe und Milch ankochen, Kohlrabi dazugeben, aufkochen lassen und 10 Minuten kochen. Dann herausnehmen und abtropfen lassen, die Brühe aufheben.Butter erhitzen, das Mehl darin anschwitzen, mit Kohlrabibrühe ablöschen und aufkochen lassen. Mit den Gewürzen abschmecken. Kohlrabi wieder dazugeben.Hähnchenbrust schnetzeln, kräftig anbraten und würzen. Das Fleisch in eine Auflaufform geben, die Speckwürfel darüber verteilen. Mit Käse bestreuen. Nun das Gemüse darüber schichten und alles bei 180 °C ca. 25 Minuten überbacken.Tipp:Man kann auch gut gekochte, in Würfel geschnittene Kartoffeln unter die Kohlrabi mischen. Ebenso kann man auch Kohlrabi und Möhren für den Auflauf nehmen. Schmeckt auch sehr lecker!
ingreds = [t.text for t, p in zip(tokenized[i], pred_i) if p]
print(set(ingreds))
{'Milch', 'Kartoffeln', 'Möhren', 'Kohlrabi', 'Butter', 'Gemüse', 'Brühe', 'Käse', 'Speckwürfel', 'Mehl'}
ingreds = [t.text for t, p in zip(tokenized[i], y_seq[i]) if p]
set(ingreds)
{'Butter', 'Kohlrabi', 'Käse', 'Mehl', 'Milch'}
ingredients[i]
['500 g Kohlrabi',
'1/4Liter Hühnerbrühe',
'1/4Liter Milch',
'1 EL Butter',
'30 g Mehl',
'300 g Hähnchenbrustfilet(s)',
'Salz und Pfeffer',
'Muskat',
'50 g Käse, gerieben',
'50 g Speck, gewürfelt']
This looks very good! Our model seems to be able to identify the ingredients better than our training labels. So we now use the produced labels for fine-tuning the network.
new_labels = []
for pred_i, ti in zip(y_pred, tokenized):
l_i = []
ci = [t.text for t, p in zip(tokenized[i], pred_i > 0.05) if p]
label = []
for token in ti:
l_i.append(any((c == token.text or c == token.text[:-1] or c[:-1] == token.text) for c in ci))
new_labels.append(l_i)
y_seq_new = []
for l in new_labels:
y_i = []
for i in range(MAX_LEN):
try:
y_i.append(float(l[i]))
except:
y_i.append(0.0)
y_seq_new.append(np.array(y_i))
y_seq_new = np.array(y_seq_new)
y_seq_new = y_seq.reshape(y_seq_new.shape[0], y_seq_new.shape[1], 1)
We fit the network again for one epoch with the new labels.
history = model.fit(X_seq, y_seq_new, epochs=1, batch_size=256, validation_split=0.1)
Train on 9900 samples, validate on 1100 samples
Epoch 1/1
9900/9900 [==============================] - 127s 13ms/step - loss: 0.0479 - acc: 0.9834 - val_loss: 0.0533 - val_acc: 0.9824
Look at test data
Now we can look at the test data we put aside in the beginning.
eval_ingredients = eval_df.Ingredients.values
eval_tokenized = [nlp(t) for t in eval_df.Instructions.values]
X_seq_test = prepare_sequences(eval_tokenized, max_len=MAX_LEN, vocab=vocab)
y_pred_test = model.predict(X_seq_test, verbose=1, batch_size=1024)
1190/1190 [==============================] - 2s 2ms/step
i = 893
pred_i = y_pred_test[i] > 0.05
print(eval_tokenized[i])
print()
print(eval_ingredients[i])
print()
ingreds = [t.text for t, p in zip(eval_tokenized[i], pred_i) if p]
print(set(ingreds))
Den Quark durch ein Sieb in eine tiefe Schüssel streichen.Das Mehl, den Zucker, Salz, Vanillezucker und das rohe Ei/er gut verrühren.Diese Masse auf einem mit Mehl bestreuten Backbrett zu einer dicken Wurst rollen und in 10 gleichgroße Scheiben schneiden. In heißer Butter von beiden Seiten goldbraun braten.Die fertigen Tworoshniki werden mit Puderzucker bestreut oder warm mit saurer Sahne oder Obstsirup zu Tisch gebracht.
['500 g Quark, sehr trockenen', '80 g Mehl', '2 EL Zucker', '1 Pck. Vanillezucker', 'Salz', '1 Ei(er), evt. 2', '4 EL Butter oder Margarine', 'Puderzucker', '125 ml Sirup (Obstsirup) oder saure Sahne', 'Mehl für die Arbeitsfläche']
{'Quark', 'Obstsirup', 'Butter', 'Vanillezucker', 'Puderzucker', 'Sahne', 'Zucker', 'Mehl', 'Salz'}
i = 26
pred_i = y_pred_test[i] > 0.05
print(eval_tokenized[i])
print()
print(eval_ingredients[i])
print()
ingreds = [t.text for t, p in zip(eval_tokenized[i], pred_i) if p]
print(set(ingreds))
Spargel putzen und bissfest garen. Herausnehmen, abschrecken und warm stellen.Fisch mit Salz und Pfeffer würzen. Öl in einer Pfanne erhitzen und den Lachs darin 3-4 Min. je Seite braten. Butter schmelzen, Mandeln hinzufügen und leicht bräunen. Schale der Limette mit einem Zestenreißer abziehen, den Saft auspressen, beides in die Butter geben. Mit Salz und Pfeffer würzen.Spargel abtropfen lassen, mit Lachs anrichten und mit Mandelbutter beträufeln.Dazu passen Salzkartoffeln.
['500 g Spargel, weißer', '500 g Spargel, grüner', 'Salz und Pfeffer', '4 Scheibe/n Lachsfilet(s) (à ca. 200g)', '2 EL Öl', '100 g Butter', '30 g Mandel(n) in Blättchen', '1 Limette(n), unbehandelt']
{'Pfeffer', 'Öl', 'Fisch', 'Saft', 'Butter', 'Limette', 'Lachs', 'Spargel', 'Mandeln', 'Salz'}
This looks quite good! We build a quite strong model to identify ingredients in recipes. I hope you learned something and had some fun. You can try to improve the model by manual labeling or adding labels from a dictionary of ingredients.