As an NLP knowledge scientist, I incessantly learn papers with subjects various from phrase vectors, RNNs, and transformers. Reading paper is enjoyable, and provides me the phantasm that I’ve mastered a variety of strategies. But in the case of reproducing them, difficulties emerge. As far as I do know, many NLP learners have run into the identical state of affairs as me. Thus I made a decision to start out a collection of posts specializing in implementing classical NLP papers. I additionally created a GitHub repository for this effort.
This publish is the first of this collection, which reproduces the GloVe mannequin primarily based on the unique paper. As acknowledged earlier than, the main target is only on the implementation. For extra details about the underlying principle, please check with the original paper.
According to the paper, the GloVe mannequin was educated with a single machine. The released code was written in C, which will be considerably unfamiliar for NLP learners. So I carried out a complete Python implementation of the mannequin, which aligns with the aim of coaching an enormous vocabulary with solely a single machine. The following sections stroll by means of the implementation particulars step-by-step. The full code is here.
For this challenge, I take advantage of the Text8 Dataset because the coaching knowledge. To get it, we will use the gensim downloader:
import gensim.downloader as apidataset = api.load("text8")
The dataset is an inventory of lists, the place every sublist is an inventory of phrases representing a sentence. We solely need a listing of all of the phrases, so flatten it with itertools:
import itertoolscorpus = listing(itertools.chain.from_iterable(dataset))
Okay, now we’ve got the coaching corpus.
When engaged on a machine studying mannequin, there’s at all times a variety of parameters to configure, resembling knowledge file path, batch dimension, phrase embedding dimension, and so on. These parameters can incur plenty of overhead when you don’t handle them effectively. Based on my expertise, I discover one of the best ways is to retailer all of them in a single yaml file with the title config.yaml. In the code, additionally add a loading perform to load configuration from the yaml file, like this:
Then for the remainder of the code, we will use the parameters as config.batch_size, config.learning_rate as an alternative of exhausting coded values, which additionally makes the code nicer.
And that’s all of the preparation work wanted. Let’s proceed with the precise two-step coaching of the GloVe mannequin!
For counting cooccurring pairs, we first want to find out the vocabulary. Here are some necessities for the vocabulary:
- It is a set of tokens showing within the corpus.
- Each token is mapped to an integer.
- If a token doesn’t belong to the corpus, it needs to be represented as an unknown token, or “unk”.
- For counting cooccurring pairs, solely a subset of the tokens are wanted, resembling the highest ok most frequent tokens.
To fulfill these necessities in a structured method, a Vocabulary class is created. This class has 4 fields:
- token2index: A dict that maps a token to an index. The index begins from 0, and increments by 1 every time a beforehand unseen token is added.
- index2token: A dict that maps an index to a token.
- token_counts: An inventory the place the ith worth is the rely of the token with index i.
- _unk_token: An integer used because the index for unknown tokens. The default worth is -1.
It additionally defines the next strategies:
- add(token): Add a brand new token into the vocabulary. If beforehand unseen, a brand new index is generated. The token’s rely is up to date as effectively.
- get_index(token): Return the index of the token.
- get_token(index): Return the token similar to the index.
- get_topk_subset(ok): Create a brand new vocabulary with the highest ok most frequent tokens.
- shuffle(): Randomly shuffle all of the tokens in order that the mapping between tokens and indices is randomized. The purpose why this technique is required might be uncovered later, after we really rely cooccurring pairs.
With this understanding in thoughts, we will now have a look at the code:
For the category implementation, I make use of Python’s dataclass function. With this function, I solely have to outline the fields with kind annotation, and the __init__() technique is robotically generated for me. I may set default worth for fields when defining them. For instance, token2index is default to an empty dict by setting default_factory=dict. For extra details about dataclass, please check with the official documentation.
Now we’ve got the Vocabulary class, the remaining query is: how can we use it? There are mainly two use circumstances:
- Create a vocabulary from the corpus, which consists of the highest ok most frequent tokens.
- When counting cooccurring pairs, use the created vocabulary to transform the corpus, which is an inventory of tokens, into integer indices.
I create one other class, Vectorizer, to coordinate these two use circumstances. It solely has one area, vocab, which refers back to the vocabulary created from the corpus. It has two strategies:
- from_corpus(corpus, vocab_size): This is a category technique. First, a vocabulary is created by including all of the tokens within the corpus. Then the highest vocab_size most frequent tokens are chosen to create a brand new vocabulary. This vocabulary is shuffled and used to instantiate the Vectorizer occasion. The purpose for shuffling might be defined later.
- vectorize(corpus): Convert the given corpus, which is an inventory of tokens, into an inventory of indices.
The full code is as beneath:
Scanning Context Windows
Now we’ve got the vectorizer to transform all phrases into indices, the remaining process is to scan all of the context home windows and rely all doable cooccurring pairs. Since the cooccurrence matrix is sparse, it’s affordable to make use of a Counter to rely the pairs. The secret’s (phrase i’s index, phrase j’s index), the place phrase j seems within the context of phrase i. The worth is a floating quantity representing the counts. However, two points could happen if utilizing this technique.
Issue 1: If we rely all cooccurring pairs in a single scan, we are going to possible run out of reminiscence, because the variety of distinct (phrase i’s index, phrase j’s index) will be huge.
Solution: Instead, we will rely cooccurring pairs in a number of scans. In every scan, we restrict phrase i’s index to be in a small vary, in order that the variety of distinct pairs is vastly diminished. Let’s say the vocabulary has 100,000 distinct tokens. If we rely all pairs in a single scan, the variety of distinct pairs will be as massive as 10¹⁰. Instead, we will rely all pairs in 10 scans. In the primary scan, we restrict phrase i’s index to be in between Zero and 9999; within the second scan, we restrict it to be in betweeen 10000 and 19999; within the third scan, we restrict it to be in between 20000 and 29999, so on and so forth. And after every scan finishes, we save the counts to disk. Now in every scan, the variety of distinct pairs will be as massive as 10⁹, which is one tenth of the unique quantity.
The thought behind this strategy is that as an alternative of calculating the entire cooccurrence matrix in a single scan, we divide the matrix into 10 smaller rectangles and calculate them sequentially. The image beneath visualizes the thought.
This strategy is scalable within the sense that because the vocabulary dimension will increase, we will at all times improve the variety of scans to scale back the reminiscence utilization. The fundamental disadvantage is that the runtime may even improve if utilizing a single machine. However, since there is no such thing as a dependency between scans, they an be simply parallelized with Spark. But that is out of our scope.
Also, at this level, the rationale for shuffling the vocabulary will be uncovered. When we create the vocabulary with probably the most frequent tokens, the index of those tokens are ordered. Index Zero corresponds to probably the most frequent token, index 1 corresponds to the second most frequent token, so on and so forth. If we proceed with the instance of 100,000 tokens, within the first scan we might rely pairs of the 10000 most frequent tokens, and the variety of distinct pairs could be large. While within the remaining scans, the variety of distinct pairs could be a lot smaller. This results in reminiscence utilization imbalance between scans. By shuffling the vocabulary, the distinct pairs are distributed evenly throughout scans and the reminiscence utilization is balanced.
Issue 2: Continue from the answer to Issue 1, how can we save the counts of every scan to disk? The most blatant approach is to write down the (phrase i’s index, phrase j’s index, rely) triplets right into a shared textual content file between scans. But utilizing this file later for coaching entails an excessive amount of overhead.
Solution: There is a python library, h5py, that gives Pythonic interface to the HDF5 binary format. It lets you retailer large quantities of numerical knowledge, and simply manipulate them as in the event that they have been actual NumPy arrays. For extra element concerning the library, take a look at its documentation.
Same as earlier than, I create a CooccurrenceEntries class that does the counting and saves the outcome to disk utilizing the proposed options. The class has two fields:
- vectorizer: The Vectorizer occasion created from the corpus.
- vectorized_corpus: An inventory of phrase indices. This is the results of vectorizing the unique corpus, which is an inventory of phrases, with the vectorizer.
It has two fundamental strategies:
- setup(corpus, vectorizer): This is a category technique used to create the CooccurrenceEntries occasion. The vectorized_corpus is generated by calling vectorizer’s vectorize technique on the corpus.
- construct(window_size, num_partitions, chunk_size, output_directory=“.” ): This technique counts the cooccurring pairs in num_partitions scans, and write the outcome to the output listing. The chunk_size argument is used to avoid wasting the info in chunks utilizing HDF5 format. The purpose for saving in chunks might be mentioned within the mannequin coaching part. In brief, it’s used for quicker producing coaching batches.
The implementation is as beneath: