Language models gained popularity in NLP in the recent years. Often models trained on large corpora of text are adapted to a custom dataset by resuming the training of the model on new data. Sometimes you might have enough data and want to train a language model like BERT or RoBERTa from scratch. Python libraries like the huggingface transformers make it quite easy to do this. While there are many tutorials about tokenization and on how to train the model, there is not much information about how to load the data into the model. This guide aims to close this gap.
Best practices from research
To understand what we want to do, we have a look at two popular research papers on language modelling with transformer models.
BERT1
In the paper the authors first define how they structure their text.
Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence.
To generate each training input sequence, they sample a span of text from the corpus, which they refer to as “sentences” even though they are typically much longer than single sentences. What they also recommend is, “to use a document-level corpus rather than a shuffled sentence-level corpus (such as the Billion Word Benchmark [Chelba et al., 2013]) in order to extract long contiguous sequences”.
RoBERTa2
First, they notice:
We find that using individual sentences hurts performance on downstream tasks.
They run extensive experiments with different approaches and the best approach was described as follows.
Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents.
These findings suggest that
- full 512 text segments should be used.
- sampling only from one document at a time does not make much difference from just sampling contiguous text.
How to achieve this with the popular libraries?
Now we find out how to achieve these things with the current huggingface transformers library and have a look at the source code.
The library offers two configurations. First the default which uses the TextDataset
class for data ingestion. The second option is --line-by-line
which uses the LineByLineTextDataset
.
TextDataset
: reads the full input text, tokenizes it and cuts it in block_sized chunks. Then adds special tokens (here just<s>
or["SEP"]
/["CLS"]
)
for i in range(0, len(tokenized_text) - block_size + 1, block_size): # Truncate in block of block_size
self.examples.append(
tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size])
)
LineByLineTextDataset
: reads each line separately, tokenizes and truncates the lines to block_size. Adds special tokens.
with open(file_path, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
Conclusions
- don’t think in terms of sentences.
- use
TextDataset
because--line-by-line
will throw away a lot of data if not used correctly. - if you use
--line-by-line
you need to be aware of what it does and structure your data yourself. - things are constantly changing and other libraries might implement different approaches.
That’s it. Let me know what you think and if this guide matches your experience. Cheers.