.
Import the required libraries:
import gensimfrom gensim.models import word2vecfrom gensim.models import KeyedVectorsfrom sklearn.metrics.pairwise import cosine_similarity
.
Download Word2Vec GoogleNews 300 dataset using Gensim downloader:
import gensim.downloader as apiwv = api.load('word2vec-google-news-300')vec_king = wv['king']print(vec_king.shape)
>>>>(300,)
.
Limit the vocabulary size to 50,000 words:
EMBEDDING_FILE = '/root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'word_vectors = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True,limit=50000)
.
Find cosine similarity:
v_apple = word_vectors["apple"]v_mango = word_vectors["mango"]print(v_apple.shape)print(v_mango.shape)cosine_similarity([v_mango],[v_apple])
>>>>(300,)
(300,)
array([[0.57518554]], dtype=float32)
.
Unfortunately, the model is unable to infer vectors for unfamiliar words. This is one limitation of Word2Vec: if this limitation matters to you, check out the FastText model.
.
try:vec_cameroon = wv['cameroon']except KeyError:print("The word 'cameroon' does not appear in this model")
>>>The word 'cameroon' does not appear in this model
.
Reference:
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html
No comments:
Post a Comment