GloVe: Global Vectors for Word Representation
Hey Guys, Welcome to the second chapter of the series. I hope you guys enjoyed my previous article on Sequence to Sequence Learning with Neural Networks. If you haven’t read it, please refer to this link. Happy Reading :)!!!
In this article, we will be going through an important research paper in the field of Natural Language Processing titled “GloVe: Global Vectors for Word Representation” authored by Jeffrey Pennington, Richard Socher, Christopher D. Manning from Stanford University. This paper was published in the year 2014 and has helped the NLP community in building intricate models that capture the nuances of English Language.
The authors have worked on developing a method to learn vector space representations of word by capturing semantic and syntactic regularities. The result is a new global log-bilinear expansion that combines the advantages of two major model families: global matrix factorization and local context window. The model has been trained only on non-zero elements in a word-word co-occurrence matrix rather than on the entire sparse matrix. It produced state of the art results with an accuracy of 75 percent on word analogy tasks.
In my previous article, under the training details, we observed that the authors had used pre-trained word embeddings with 160,000 English words. The vector representation mentioned above is also known as word embeddings. These word embeddings have helped language models to have a better accuracy as the model understands the semantic relationship between words.
In this article, we will be going through:
1. Word embeddings — meaning and reasons for usage in NLP
2. Global Matrix Factorization — Math and drawbacks
3. Local Context Window — Math and drawbacks
4. Current model- GloVe
5. Training details
Let’s get started!
1. Word embeddings — meaning and reasons for usage in NLP
Let’s consider a sentence as below:
“The lady went to buy groceries from the store.”
Now if we have to feed a representation of this sentence to a computer, how will you do it?
A very simple approach called ‘one-hot encoding’ was followed in the earlier stages of Natural Language Processing. Let’s assume that we have a 10,000-word corpus. We will assign an index to every word in the corpus. Let Oi be a column vector with 0’s at all position and 1 at the index i.Suppose the word ‘the’ is the 9281st word of our corpus. This word can be represented by O9281 which is a 10,000-dimensional column vector with 1 at 9281st row. Similarly, the word ‘groceries’can be represented by O1246 with 1 at 1246th row. The whole sentence can be represented by a matrix with 9 column vectors with 10,000 dimensions each (‘9’ refers to the number of words in the sentence above).
Imagine the sparsity of the matrix. It is bulky, it will take a lot of time for computation and it doesn’t help the machine in understanding the relationship between words (the machine can’t tell that apples, oranges and bananas are fruits with these sparse vectors).
So how do you think companies like Google, IBM have been able to develop conversational AI’s that can generate answers to questions being posed? How can machines understand the nuances of English words and correlate words together?
It is because of word embeddings. To give a basic overview of this concept, it is a spatial representation of words in a n dimensional space. Let’s look at the picture given below
This is a very simple representation of words in a 2-dimensional space where Japan, Tokyo are adjacent to each other which implies that these 2 words are similar to each other. If you can extend these co-ordinates into a n dimensional space, it becomes a word embedding where every word has a co-ordinate associated with it and words that are similar to each other like synonyms are adjacent to each other.
These word embeddings capture the semantic and syntactic nuances of words which help the machine to understand that apples and oranges are similar to each other and belong to the same family. These embeddings have laid a strong base for various applications of Natural Language Processing.
Global Matrix Factorization- Math and Drawbacks
Latent Semantic Analysis was one of the predominant methods for learning word vectors and representations. Assuming we have a large corpus of documents that are available, we take the term frequency in every document (a counter for various words) and convert that into a term-document matrix. Here, each column represents a document and each row represents the distribution of a particular word across various documents. Let’s understand the concept with an example:
Let’s assume we have 6 documents with us and build a term-document matrix.
The picture above shows a term document matrix where d1,d2,d3,d4,d5,d6 are documents. The value in the matrix denotes the frequency of word i in document j.
Let’s apply Singular Value Decomposition to this matrix. Given a matrix M of dimensions m x n, it can be split into a product of 3 matrices such that,
U is a unitary matrix of dimensions [m,n]
D is a diagonal matrix of dimensions [n.n]
Vis a unitary matrix of dimensions [n,n]
If we apply SVD on matrix M, we get
The 3 matrices above have their own significance. Matrix U is a spatial representation of the 5 words in a 5-dimensional space. The diagonal matrix is a 5 x 5 diagonal matrix where the diagonal entries highlight the significance of the 5 dimensions. The last matrix is a spatial representation of the 6 documents in a 5-dimensional space. With these, we can form clusters and understand the word representation.
The main drawback with such approach is that this word embedding performs poorly in word analogy task indicating a sub-optimal vector space structure.
Local Context Window — Math and drawbacks
One of the most commonly used methods under local context window is skip-grams method.
Let’s take an example to understand this concept
“The quick brown fox jumps over the lazy dog”
Assuming that the local context window is 2. Let’s take the first word “the”, what are the two words near to it? In this case, we can get pairs like (the, quick)and (the, brown).For the word “fox”, we can get pairs like (fox, brown) ,(fox, jumps),(fox, quick),(fox, over).All these word pairs are fed as training samples to a neural network. But how do we represent these words as numbers? We can’t feed a word to a neural network and expect it to learn similar words. For this case, we need to feed ‘one-hot encoding’(similar to the one in LSA) to a neural network and learn the word distribution.
This is a pictorial representation of a neural network that was used to train a skip-gram model. It had 300 Hidden Layer Linear Neurons with no activation function. We had a SoftMax output layer that gave us a distribution of words given an input. With a large training corpus like the example mentioned above, we can learn word analogies better and it gives us a sense of word representations in a N-Dimensional space.
The drawback associated with this model is that they poorly utilize the statistics of corpus since they train on separate local context windows instead on global co-occurrence counts.
The main aim of unsupervised learning is to extract meaningful interpretations from data purely based on statistics. In the case of word embeddings, we need to understand how to extract best information from word co-occurrence matrix using statistics. The authors have proposed a novel approach towards understanding word embeddings. This method takes the best characteristics from the methods mentioned above. Let’s assume we have a word-word co-occurrence matrix Xij which denotes the number of times word j occurs in the context of i.
denote the number of times any word occurs in the context of i.
denote probability that word jappears in the context of word i.
Let’s analyze a word corpus to construct a word embedding.
This example will help you to understand how certain aspects of meaning can be extracted directly using word co-occurrence matrix. Let’s take i = ice, j = steam.Let’s take the words k = solid, gas, water and fashion. For words k related to ice but not related to steam, the ratio P(k/ice)/P(k/steam) will be large and vice versa. Similarly, if k is either related to ice, steam or neither the ratio will be close to 1. With these ratios and probabilities, we can separate relevant words like solid, gasfrom irrelevant words like water, fashion.
The ratio seems to be a good starting point for word vector learning. The probability ratio
depends on 3 words i, j, k.
This can be represented mathematically as a function F
The right-hand side can be derived from the word corpus. The function F can have numerous possibilities but by enforcing some desired characteristics, we can get a unique choice.With the help of word k, we are able to draw a relation between words i & j. Since vector spaces are inherently linear structures, the most natural way to do this is with vector differences. In a way it helps us to understand the distance between the two words in a given space,
Since the value on the right-hand side is scalar, we need to make sure that the function F returns a scalar. We can take a dot product of 2 vectors to get a scalar
We can write the equation above in another form as follows
Because the term Pik depends on words i & k which is a function of the weights for those words.If we assume F to be a exponential function, then
Since the term Xi is independent of the word k, it can be considered as a bias term for the word wi. We also need to add a bias term for wk to restore symmetry in the equation. Therefore, the equation becomes,
The resulting equation is similar to a linear equation. The main drawback of this model is that it weighs all co-occurrences equally. Converting this into a least squares problem, the authors have generated a least square cost function to learn the parameters.
Where V refers to the Vocabulary size. We will feed these details to a neural network and train the model with the cost function mentioned above.The characteristics of f should be as follows:
a) It should be continuous
b) It should be non-decreasing so that rare co-occurrences are not overweighed.
c) It should be really small for large values of X so that frequent co-occurrences are not overweighed.
Based on experimental results, It was derived that
And it was also found that alpha value of ¾ gave a modest improvement when compared to alpha value of 1. The authors had set the value of Xmaximum to be 100.
The model was trained on 5 corpora of various sizes:
a) 2010 Wiki Dump with 1 billion tokens
b) 2014 Wiki dump with 1.6 billion tokens
c) Gigaword 5 with 4.3 billion tokens
d) Gigaword 5 + Wikipedia 2014 has 6 billion tokens
e) 42 Billion tokens of web data from Common Crawl.
Stanford tokenizer was used to build a word corpus of 400,000 commonly used words and matrix X was constructed using these words. They used a decreasing function to determine the context window so that word pairs that are d pairs apart contribute to 1/d to the total count. An initial learning rate of 0.05 was chosen. The model had produced word embeddings that are 100 dimensional and 300 dimensional.
Glove Vectors have produced state-of-the-art results with an accuracy of 75 percent on word analogy tasks . A detailed explanation of the results can be found here.
I hope you guys enjoyed my second article on word embeddings. Stay tuned for many more interesting articles in the future.
- Jeffrey Pennington, Richard Socher, Christopher D. Manning GloVe: Global Vectors for Word Representation[link]