View source code on Github page
Use Gensim to train a word2vec model
Use Gensim tool to tain a word embedding model on text8
dataset and the training process is based on skip-gram
architecture.
Reference:
[1] https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/
# import modules and set up logging
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# load up unzipped corpus from http://mattmahoney.net/dc/text8.zip
sentences = word2vec.Text8Corpus('data/text8')
# train the skip-gram model; default window=5
model = word2vec.Word2Vec(sentences, size=200)
Task 1 Analogy Prediction
- Input: a pair of words <a,b> and a third word c
- Output: the forth word d which holds "a is to be as c is to d"
For example
- "man -> woman" => "king -> queen"
- "Japan -> Japanese" => "Australia -> Australian"
- "Paris -> France" => "Beijing -> China"
The above task can describe as Math form:
Given word vector $w_a,w_b,w_c$ to find a word vector $w_d$ satisfying:
$$ w_d-w_c\approx w_b-w_a\Leftrightarrow w_d\approx w_b-w_a+w_c $$
Then we solve
$$ d=\mathop {\text{arg}\max}_{d\in V}\cos \left( w_b-w_a+w_c,w_d \right) =\mathop {\text{arg}\max}_{d\in V}\frac{\left( w_b-w_a+w_c,w_d \right) \cdot w_d}{|w_b-w_a+w_c,w_d|\cdot |w_d|} $$
# Use gensim api
test_groups = [
('man', 'woman', 'king'),
('small', 'smaller', 'big'),
('italy', 'italian', 'china'),
('japan', 'tokyo', 'china'),
('cool', 'coolest', 'cold'),
('dark', 'darkest', 'easy'),
('listening', 'listened', 'moving'),
('looking', 'looked', 'swimming'),
('playing', 'played', 'taking'),
('increase', 'increases', 'decrease'),
('predict', 'predicts', 'shuffle'),
('provide', 'provides', 'search'),
('say', 'says', 'speak'),
('Austria', 'Austrian', 'Sweden'),
('Cambodia', 'Cambodian', 'Australia'),
('paying', 'paid', 'striking'),
('running', 'ran', 'taking'),
('selling', 'sold', 'thinking'),
('shrinking', 'shrank', 'jumping')
]
for w in test_groups:
print("%s -> %s like %s -> %s" % (w[0], w[1], w[2], model.wv.most_similar(positive=[w[1].lower(), w[2].lower()], negative=[w[0].lower()], topn=1)[0][0]))
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese
japan -> tokyo like china -> shanghai
cool -> coolest like cold -> kargil
dark -> darkest like easy -> customers
listening -> listened like moving -> penetrated
looking -> looked like swimming -> rides
playing -> played like taking -> took
increase -> increases like decrease -> decreases
predict -> predicts like shuffle -> gutter
provide -> provides like search -> tutorial
say -> says like speak -> speaks
Austria -> Austrian like Sweden -> swedish
Cambodia -> Cambodian like Australia -> canada
paying -> paid like striking -> noticeable
running -> ran like taking -> took
selling -> sold like thinking -> thought
shrinking -> shrank like jumping -> kicking
words = model.wv.vocab.keys() # contains all vocabularies
word2vec = model.wv # Get a certain word vector by word2vec[word]
# Difine naive most_similar() function
import numpy as np
def cal_cosine(vec_a, vec_b):
"""
Compute the cosine similarity between vec_a and vec_b
Input: two word vectors vec_a and vec_b
Output: the cosine similarity of the two vectors
"""
numerator = np.dot(vec_a.T,vec_b)
denominator = np.sqrt(sum(np.square(vec_a))) * np.sqrt(sum(np.square(vec_b)))
return numerator / denominator
def find_analogy(a, b, c):
"""
Find the analogy word
Input: a pair of words <a,b> and a third word c
Output: the forth word d which holds "a is to be as c is to d"
"""
a, b, c = a.lower(), b.lower(), c.lower() # lower all the letters
max_cosine = -1 # initial the max_cosine value = -1
for word in words:
if word in [a, b, c]:
continue
cosine = cal_cosine(word2vec[b] - word2vec[a] + word2vec[c], word2vec[word])
# if find a bigger cosine value, then save the cosine value and related word
if cosine > max_cosine:
max_cosine = cosine
d = word
return d
for w in test_groups:
print("%s -> %s like %s -> %s" % (w[0], w[1], w[2], find_analogy(w[0], w[1], w[2])))
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese
japan -> tokyo like china -> beijing
cool -> coolest like cold -> falklands
dark -> darkest like easy -> difficult
listening -> listened like moving -> moves
looking -> looked like swimming -> perished
playing -> played like taking -> took
increase -> increases like decrease -> decreases
predict -> predicts like shuffle -> lodewijk
provide -> provides like search -> summarizes
say -> says like speak -> speaks
Austria -> Austrian like Sweden -> swedish
Cambodia -> Cambodian like Australia -> canada
paying -> paid like striking -> caught
running -> ran like taking -> took
selling -> sold like thinking -> realized
shrinking -> shrank like jumping -> kicking
# Testing data is downloaded from https://code.google.com/archive/p/word2vec/source/default/source
# load the testing data set
with open("data/questions-words.txt", "r") as f:
total_line = 0
count = 0
no_key_num = 0
for line in f.readlines():
v = line.strip().split(" ")
if v[0] == ':':
# if the line begins with ":", do nothing
continue
try:
pred_word = model.wv.most_similar(positive=[v[1].lower(), v[2].lower()], negative=[v[0].lower()], topn=1)[0][0]
except KeyError:
# if the word doesn't appear in the model vacabulary list
no_key_num += 1
else:
# v[3] is the right answer
if pred_word == v[3]:
count += 1
total_line += 1
acc = count / total_line
print("The accuracy on the test data = %s" % acc)
print("%s pairs of data were not used because of no matched word in the model" % no_key_num)
The accuracy on the test data = 0.160793194874061
1440 pairs of data were not used because of no matched word in the model
# Gensim API to evaluate the analogy prediction task
model.wv.evaluate_word_analogies(questions="data/questions-words.txt")
C:\Users\yelbee\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `accuracy` (Method will be removed in 4.0.0, use self.evaluate_word_analogies() instead).
"""Entry point for launching an IPython kernel.
2020-04-05 17:06:55,747 : INFO : capital-common-countries: 34.2% (173/506)
2020-04-05 17:07:01,033 : INFO : capital-world: 18.7% (271/1452)
2020-04-05 17:07:02,009 : INFO : currency: 12.7% (34/268)
2020-04-05 17:07:07,788 : INFO : city-in-state: 11.0% (173/1571)
2020-04-05 17:07:08,987 : INFO : family: 81.4% (249/306)
2020-04-05 17:07:11,881 : INFO : gram1-adjective-to-adverb: 12.3% (93/756)
2020-04-05 17:07:12,983 : INFO : gram2-opposite: 17.3% (53/306)
2020-04-05 17:07:18,473 : INFO : gram3-comparative: 63.9% (805/1260)
2020-04-05 17:07:20,422 : INFO : gram4-superlative: 34.4% (174/506)
2020-04-05 17:07:24,023 : INFO : gram5-present-participle: 30.6% (304/992)
2020-04-05 17:07:30,591 : INFO : gram6-nationality-adjective: 56.1% (769/1371)
2020-04-05 17:07:36,784 : INFO : gram7-past-tense: 27.0% (359/1332)
2020-04-05 17:07:40,974 : INFO : gram8-plural: 43.6% (433/992)
2020-04-05 17:07:43,601 : INFO : gram9-plural-verbs: 35.1% (228/650)
2020-04-05 17:07:43,602 : INFO : total: 33.6% (4118/12268)
From above, we implement our own naive method to calculate cosine similarity to do analogy prediction task, it run slower than Gensim api method. Most of their prediction results are the same like:
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese
However, there are some different prediction results like:
Naive Method:japan -> tokyo like china -> (beijing)
Gensim API Method: japan -> tokyo like china -> (shanghai)
Naive Method: selling -> sold like thinking -> (realized)
Gensim API Method: selling -> sold like thinking -> (thought)
In the first situation, the Naive Method
is right because Beijing is the capital of China and so as Tokyo to Japan. And the second pairs should describe the past tense of a word and the Gensim API Method
is more correct.
At last, we test our model on the Google Analogy Test Set.
Task 2 Clustering Task
In this task, we use K-Means algorithm to cluter the word into 100 groups.
Input:
- $N$ vectors $x_1, x_2, \cdots, x_N \in \mathbb{R}^n$
- $k$: the number of cluters we want
Output:
- $c_i (i=1,2,\cdots,N)$: the cluter that $x_i$ belongs to
- $z_j (j=1,2, \cdots, k)$: the representative vector of each cluster
Initialization: Initialize $z_1,z_2, \cdots, z_k$ by choosing $k$ vectors from $x_1, x_2, \cdots, x_N$ randomly
Step 1: Given $z_1,z_2, \cdots, z_k$, compute
$$ c_i = \mathop{\arg \min}_{j \in \{1,2,\cdots,k\}} \|x_i-z_j\|_2^2, i =1,2,\cdots,N $$
and define
$$ G_j = \{i | c_i=j\},j=1,2,\cdots,k $$
Step 2: Given $G_1,G_2,\cdots,G_k$, compute
$$ z_j = \frac{1}{|G_j|}\left(\sum_{i\in F_j} x_i \right) $$
Go back Step 1 until convergent
# here you load vectors for each word in your model
word2vec_vectors = model.wv.vectors
# Form a dict that use word to find the index
ind_to_word = {model.wv.vocab[word].index : word for word in model.wv.vocab}
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=100, random_state=0).fit(word2vec_vectors)
kmeans.labels_
array([92, 4, 4, ..., 6, 6, 6])
# Build a cluster structure
# Dict{cluster_num : [word list]}
clusters = {}
index = 0
for n in kmeans.labels_:
if n not in clusters.keys():
clusters[n] = [ind_to_word[index]]
else:
clusters[n].append(ind_to_word[index])
index += 1
for n, cluster in clusters.items():
print("%s -> %s" %(n, cluster[:5]))
92 -> ['the', 'in', 'for', 'on', 'first']
4 -> ['of', 'and', 'with', 'making', 'permanent']
73 -> ['one', 'zero', 'nine', 'two', 'eight']
10 -> ['a', 'as', 'or', 'an', 'being']
38 -> ['to', 'can', 'may', 'would', 'will']
21 -> ['is', 'means', 'uses', 'includes', 'remains']
78 -> ['s', 'mark', 'don', 'brown', 'ray']
42 -> ['was', 'became', 'left', 'began', 'took']
77 -> ['by', 'under', 'initially', 'subsequently', 'jordan']
1 -> ['that', 'it', 'this', 'which', 'also']
20 -> ['are', 'use', 'include', 'related', 'believe']
84 -> ['from', 'according', 'addition', 'giving', 'dedicated']
97 -> ['his', 'he', 'her', 'him', 'she']
52 -> ['be', 'make', 'become', 'take', 'run']
79 -> ['at', 'city', 'home', 'near', 'center']
61 -> ['have', 'has', 'had', 'having']
93 -> ['were', 'war', 'against', 'military', 'force']
37 -> ['other', 'their', 'some', 'all', 'such']
96 -> ['its', 'power', 'due', 'control', 'development']
81 -> ['more', 'most', 'very', 'less', 'particularly']
51 -> ['been', 'made', 'led', 'developed', 'included']
99 -> ['used', 'known', 'called', 'found', 'considered']
72 -> ['there', 'held', 'created', 'produced', 'established']
68 -> ['american', 'john', 'james', 'william', 'david']
48 -> ['time', 'years', 'year', 'day', 'times']
32 -> ['see', 'links', 'external', 'list', 'information']
74 -> ['than', 'about', 'over', 'around', 'every']
26 -> ['world', 'states', 'u', 'national', 'group']
86 -> ['b', 'd', 'born', 'actor', 'author']
89 -> ['people', 'population', 'groups', 'living', 'species']
33 -> ['united', 'british', 'canada', 'spanish', 'australia']
56 -> ['system', 'computer', 'systems', 'data', 'standard']
64 -> ['state', 'law', 'court', 'rights', 'act']
94 -> ['history', 'century', 'modern', 'old', 'greek']
45 -> ['up', 'out', 'right', 'line', 'back']
90 -> ['english', 'name', 'language', 'term', 'word']
54 -> ['well', 'much', 'common', 'popular', 'important']
19 -> ['e', 'c', 'x', 'g', 't']
71 -> ['government', 'party', 'members', 'parliament', 'elected']
95 -> ['m', 'km', 'square', 'miles', 'feet']
66 -> ['university', 'school', 'college', 'education', 'medical']
67 -> ['life', 'work', 'way', 'view', 'nature']
34 -> ['like', 'black', 'white', 'red', 'blue']
30 -> ['including', 'best', 'famous', 'art', 'writers']
76 -> ['example', 'form', 'set', 'numbers', 'function']
14 -> ['french', 'german', 'italian', 'russian', 'prize']
49 -> ['general', 'president', 'former', 'leader', 'chief']
85 -> ['high', 'level', 'low', 'rate', 'higher']
87 -> ['based', 'originally', 'test', 'via', 'multi']
0 -> ['now', 'principal', 'cook', 'resident', 'venice']
47 -> ['de', 'l', 'al', 'la', 'san']
41 -> ['music', 'style', 'band', 'album', 'rock']
11 -> ['great', 'possibly', 'lives', 'bad', 'apparently']
27 -> ['south', 'north', 'area', 'west', 'east']
91 -> ['series', 'film', 'character', 'story', 'films']
31 -> ['game', 'player', 'games', 'team', 'play']
57 -> ['country', 'europe', 'france', 'england', 'germany']
80 -> ['king', 'ii', 'roman', 'empire', 'emperor']
9 -> ['book', 'works', 'published', 'books', 'text']
22 -> ['political', 'others', 'social', 'movement', 'anti']
25 -> ['church', 'god', 'christian', 'jewish', 'religious']
40 -> ['theory', 'science', 'natural', 'research', 'study']
55 -> ['using', 'single', 'type', 'lines', 'color']
70 -> ['human', 'cause', 'effects', 'health', 'blood']
5 -> ['point', 'field', 'above', 'position', 'range']
50 -> ['public', 'us', 'company', 'economic', 'production']
98 -> ['man', 'men', 'children', 'person', 'women']
69 -> ['york', 'london', 'california', 'county', 'founded']
62 -> ['house', 'official', 'member', 'minister', 'council']
24 -> ['water', 'energy', 'material', 'cell', 'chemical']
46 -> ['original', 'version', 'released', 'video', 'release']
53 -> ['air', 'service', 'fire', 'aircraft', 'nuclear']
65 -> ['space', 'earth', 'light', 'image', 'star']
88 -> ['said', 'claim', 'claims', 'stated', 'says']
18 -> ['along', 'across', 'ice', 'cold', 'upper']
58 -> ['show', 'television', 'uk', 'radio', 'live']
63 -> ['january', 'march', 'december', 'july', 'june']
59 -> ['areas', 'parts', 'cities', 'outside', 'currently']
15 -> ['terms', 'forms', 'cases', 'elements', 'events']
13 -> ['japanese', 'etc', 'except', 'unlike', 'respectively']
3 -> ['class', 'bond', 'composition', 'partial', 'representing']
7 -> ['whose', 'raised', 'executed', 'serving', 'attacked']
83 -> ['my', 'big', 'dead', 'cover', 'dark']
16 -> ['irish', 'becoming', 'historically', 'amongst', 'elsewhere']
29 -> ['action', 'defense', 'police', 'acts', 'intelligence']
17 -> ['food', 'animals', 'gold', 'animal', 'iron']
28 -> ['co', 'am', 'na', 'die', 'ch']
8 -> ['award', 'grand', 'race', 'fame', 'super']
6 -> ['previously', 'fourteen', 'handful', 'polls', 'fortunes']
75 -> ['cycle', 'formation', 'secondary', 'producing', 'naturally']
60 -> ['regarding', 'concerning', 'describing', 'aristotle', 'oral']
23 -> ['semi', 'showing', 'apart', 'gates', 'whilst']
35 -> ['honor', 'adam', 'abraham', 'muhammad', 'passage']
82 -> ['partially', 'locally', 'carefully', 'lacking', 'attraction']
12 -> ['moreover', 'consequently', 'besides', 'repeatedly', 'likewise']
36 -> ['independently', 'comprises', 'marking', 'excluding', 'astronomers']
39 -> ['associate', 'publishers', 'eds', 'chair', 'wesley']
44 -> ['thirteen', 'eighteen', 'culminating', 'seventy', 'narrowly']
2 -> ['ironically', 'nice', 'sixteen', 'stamp', 'nights']
43 -> ['onwards', 'popularly', 'atta', 'eighty', 'variously']
We need to illustrate how to pick best clusters, i.e. how to evaluate a cluster's good or bad? We use the L2-norm of the vectors compared with the center vector in a cluster to evaluate.
Define score
$$ score=\sum_{i\in Cluster}{|}w_i-w_{center}|_{2}^{2} $$
Hence, the lower the score, the better the cluster.
for n, cluster in clusters.items():
scores = 0
for word in cluster:
# the smaller the score, the better the cluster
scores += sum(np.square(word2vec[word] - kmeans.cluster_centers_[n]))
# Original clusters => Dict{cluster_num : [word list]}
# Now it becomes Dict(cluster_num : ([word list], score))
clusters[n] = (clusters[n], scores)
sorted(clusters.items(), key = lambda x: x[1][1])[:5] # sortedby score ascending
[(63,
(['january',
'march',
'december',
'july',
'june',
'november',
'april',
'september',
'august',
'october',
'february'],
165.74816360414897)),
(61, (['have', 'has', 'had', 'having'], 586.166584790995)),
(81,
(['more',
'most',
'very',
'less',
'particularly',
'too',
'highly',
'enough',
'relatively',
'quite',
'extremely',
'somewhat',
'increasingly',
'fairly'],
1517.4123485378213)),
(86,
(['b',
'd',
'born',
'actor',
'author',
'writer',
'singer',
'actress',
'composer',
'poet',
'musician',
'artist',
'politician',
'philosopher',
'mathematician',
'painter',
'journalist',
'footballer',
'novelist'],
2115.24780004326)),
(47,
(['de',
'l',
'al',
'la',
'san',
'paris',
'et',
'le',
'el',
'der',
'des',
'bwv',
'del',
'da',
'di',
'il',
'du',
'en',
'ma',
'santa',
'te',
'les',
'fran',
'ne',
'juan',
'und',
'sur'],
2692.603341114255))]
According to the above results, we pick up top 4 best clusters:
Cluster#63
=>{'january','march','december','july','june',november','april','september','august','october','february'}
- score = 165.75
- This cluster contains months
Cluster#61
=>{'have', 'has', 'had', 'having'}
- socre = 586.17
- This cluster contains different tense of word have
Cluster#81
=>{'more','most','very','less','particularly','too','highly','enough','relatively','quite','extremely','somewhat','fairly'}
- score = 1517.41
- They are some kinds of degree adverb
Cluster#86
=>{'b','d','born','actor','author','writer','singer','actress','composer','poet','musician','artist','politician','philosopher','mathematician','painter','journalist','footballer','novelist'}
- score = 2115.25
- This cluster contains many occupations
本博客文章除特别声明外,均可自由转载与引用,转载请标注原文出处:http://www.yelbee.top/index.php/archives/186/