Gensim을 통해 벡터화 & t-SNE로 시각화

Word2Vec 모델을 학습

전처리를 거쳐 파싱된 문장의 목록으로 모델을 학습시킬 준비가 되었다.

Gensim

Word2Vec 모델의 파라메터

  • 아키텍처 : 아키텍처 옵션은 skip-gram (default) 또는 CBOW 모델이다. skip-gram (default)은 느리지만 더 나은 결과를 낸다.

  • 학습 알고리즘 : Hierarchical softmax (default) 또는 negative 샘플링. 여기에서는 기본값이 잘 동작한다.

  • 빈번하게 등장하는 단어에 대한 다운 샘플링 : Google 문서는 .00001에서 .001 사이의 값을 권장한다. 여기에서는 0.001에 가까운 값이 최종 모델의 정확도를 높이는 것으로 보여진다.

  • 단어 벡터 차원 : 많은 feature를 사용한다고 항상 좋은 것은 아니지만 대체적으로 좀 더 나은 모델이 된다. 합리적인 값은 수십에서 수백 개가 될 수 있고 여기에서는 300으로 지정했다.

  • 컨텍스트 / 창 크기 : 학습 알고리즘이 고려해야 하는 컨텍스트의 단어 수는 얼마나 될까? hierarchical softmax 를 위해 좀 더 큰 수가 좋지만 10 정도가 적당하다.

  • Worker threads : 실행할 병렬 프로세스의 수로 컴퓨터마다 다르지만 대부분의 시스템에서 4에서 6 사이의 값을 사용하다.

  • 최소 단어 수 : 어휘의 크기를 의미 있는 단어로 제한하는 데 도움이 된다. 모든 문서에서 여러 번 발생하지 않는 단어는 무시된다. 10에서 100 사이가 적당하며, 이 경진대회의 데이터는 각 영화가 30개씩의 리뷰가 있기 때문에 개별 영화 제목에 너무 많은 중요성이 붙는 것을 피하고자 최소 단어 수를 40으로 설정한다. 그 결과 전체 어휘 크기는 약 15,000단어가 된다. 높은 값은 제한 된 실행시간에 도움이 된다.

import logging
logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s', 
    level=logging.INFO)
# 파라메터값 지정
num_features = 300 # 문자 벡터 차원 수
min_word_count = 40 # 최소 문자 수
num_workers = 4 # 병렬 처리 스레드 수
context = 10 # 문자열 창 크기
downsampling = 1e-3 # 문자 빈도수 Downsample

# 초기화 및 모델 학습
from gensim.models import word2vec

# 모델 학습
model = word2vec.Word2Vec(sentences, 
                          workers=num_workers, 
                          size=num_features, 
                          min_count=min_word_count,
                          window=context,
                          sample=downsampling)
model

2018-01-16 15:46:48,195 : INFO : 'pattern' package not found; tag filters are not available for English
2018-01-16 15:46:48,202 : INFO : collecting all words and their counts
2018-01-16 15:46:48,203 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-01-16 15:46:48,267 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 12465 word types
2018-01-16 15:46:48,338 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 17070 word types
2018-01-16 15:46:48,431 : INFO : PROGRESS: at sentence #30000, processed 671314 words, keeping 20370 word types
2018-01-16 15:46:48,523 : INFO : PROGRESS: at sentence #40000, processed 897814 words, keeping 23125 word types
2018-01-16 15:46:48,579 : INFO : PROGRESS: at sentence #50000, processed 1116962 words, keeping 25365 word types
2018-01-16 15:46:48,633 : INFO : PROGRESS: at sentence #60000, processed 1338403 words, keeping 27283 word types
2018-01-16 15:46:48,690 : INFO : PROGRESS: at sentence #70000, processed 1561579 words, keeping 29024 word types
2018-01-16 15:46:48,746 : INFO : PROGRESS: at sentence #80000, processed 1780886 words, keeping 30603 word types
2018-01-16 15:46:48,812 : INFO : PROGRESS: at sentence #90000, processed 2004995 words, keeping 32223 word types
2018-01-16 15:46:48,871 : INFO : PROGRESS: at sentence #100000, processed 2226966 words, keeping 33579 word types
2018-01-16 15:46:48,926 : INFO : PROGRESS: at sentence #110000, processed 2446580 words, keeping 34827 word types
2018-01-16 15:46:48,982 : INFO : PROGRESS: at sentence #120000, processed 2668775 words, keeping 36183 word types
2018-01-16 15:46:49,034 : INFO : PROGRESS: at sentence #130000, processed 2894303 words, keeping 37353 word types
2018-01-16 15:46:49,095 : INFO : PROGRESS: at sentence #140000, processed 3107005 words, keeping 38376 word types
2018-01-16 15:46:49,155 : INFO : PROGRESS: at sentence #150000, processed 3332627 words, keeping 39556 word types
2018-01-16 15:46:49,218 : INFO : PROGRESS: at sentence #160000, processed 3555315 words, keeping 40629 word types
2018-01-16 15:46:49,277 : INFO : PROGRESS: at sentence #170000, processed 3778655 words, keeping 41628 word types
2018-01-16 15:46:49,333 : INFO : PROGRESS: at sentence #180000, processed 3999236 words, keeping 42599 word types
2018-01-16 15:46:49,396 : INFO : PROGRESS: at sentence #190000, processed 4224449 words, keeping 43461 word types
2018-01-16 15:46:49,454 : INFO : PROGRESS: at sentence #200000, processed 4448603 words, keeping 44301 word types
2018-01-16 15:46:49,517 : INFO : PROGRESS: at sentence #210000, processed 4669967 words, keeping 45212 word types
2018-01-16 15:46:49,597 : INFO : PROGRESS: at sentence #220000, processed 4894968 words, keeping 46134 word types
2018-01-16 15:46:49,647 : INFO : PROGRESS: at sentence #230000, processed 5117545 words, keeping 46986 word types
2018-01-16 15:46:49,717 : INFO : PROGRESS: at sentence #240000, processed 5345050 words, keeping 47854 word types
2018-01-16 15:46:49,769 : INFO : PROGRESS: at sentence #250000, processed 5559165 words, keeping 48699 word types
2018-01-16 15:46:49,828 : INFO : PROGRESS: at sentence #260000, processed 5779146 words, keeping 49469 word types
2018-01-16 15:46:49,896 : INFO : PROGRESS: at sentence #270000, processed 6000435 words, keeping 50416 word types
2018-01-16 15:46:49,956 : INFO : PROGRESS: at sentence #280000, processed 6226314 words, keeping 51640 word types
2018-01-16 15:46:50,017 : INFO : PROGRESS: at sentence #290000, processed 6449474 words, keeping 52754 word types
2018-01-16 15:46:50,094 : INFO : PROGRESS: at sentence #300000, processed 6674077 words, keeping 53755 word types
2018-01-16 15:46:50,162 : INFO : PROGRESS: at sentence #310000, processed 6899391 words, keeping 54734 word types
2018-01-16 15:46:50,222 : INFO : PROGRESS: at sentence #320000, processed 7124278 words, keeping 55770 word types
2018-01-16 15:46:50,294 : INFO : PROGRESS: at sentence #330000, processed 7346021 words, keeping 56687 word types
2018-01-16 15:46:50,355 : INFO : PROGRESS: at sentence #340000, processed 7575533 words, keeping 57629 word types
2018-01-16 15:46:50,413 : INFO : PROGRESS: at sentence #350000, processed 7798803 words, keeping 58485 word types
2018-01-16 15:46:50,475 : INFO : PROGRESS: at sentence #360000, processed 8019466 words, keeping 59345 word types
2018-01-16 15:46:50,535 : INFO : PROGRESS: at sentence #370000, processed 8246654 words, keeping 60161 word types
2018-01-16 15:46:50,609 : INFO : PROGRESS: at sentence #380000, processed 8471801 words, keeping 61069 word types
2018-01-16 15:46:50,667 : INFO : PROGRESS: at sentence #390000, processed 8701551 words, keeping 61810 word types
2018-01-16 15:46:50,730 : INFO : PROGRESS: at sentence #400000, processed 8924500 words, keeping 62546 word types
2018-01-16 15:46:50,794 : INFO : PROGRESS: at sentence #410000, processed 9145850 words, keeping 63263 word types
2018-01-16 15:46:50,862 : INFO : PROGRESS: at sentence #420000, processed 9366930 words, keeping 64024 word types
2018-01-16 15:46:50,923 : INFO : PROGRESS: at sentence #430000, processed 9594467 words, keeping 64795 word types
2018-01-16 15:46:50,980 : INFO : PROGRESS: at sentence #440000, processed 9821218 words, keeping 65539 word types
2018-01-16 15:46:51,043 : INFO : PROGRESS: at sentence #450000, processed 10044980 words, keeping 66378 word types
2018-01-16 15:46:51,100 : INFO : PROGRESS: at sentence #460000, processed 10277740 words, keeping 67158 word types
2018-01-16 15:46:51,169 : INFO : PROGRESS: at sentence #470000, processed 10505665 words, keeping 67775 word types
2018-01-16 15:46:51,227 : INFO : PROGRESS: at sentence #480000, processed 10726049 words, keeping 68500 word types
2018-01-16 15:46:51,282 : INFO : PROGRESS: at sentence #490000, processed 10952793 words, keeping 69256 word types
2018-01-16 15:46:51,348 : INFO : PROGRESS: at sentence #500000, processed 11174449 words, keeping 69892 word types
2018-01-16 15:46:51,409 : INFO : PROGRESS: at sentence #510000, processed 11399724 words, keeping 70593 word types
2018-01-16 15:46:51,469 : INFO : PROGRESS: at sentence #520000, processed 11623075 words, keeping 71267 word types
2018-01-16 15:46:51,522 : INFO : PROGRESS: at sentence #530000, processed 11847473 words, keeping 71877 word types
2018-01-16 15:46:51,604 : INFO : PROGRESS: at sentence #540000, processed 12072088 words, keeping 72537 word types
2018-01-16 15:46:51,687 : INFO : PROGRESS: at sentence #550000, processed 12297639 words, keeping 73212 word types
2018-01-16 15:46:51,739 : INFO : PROGRESS: at sentence #560000, processed 12518929 words, keeping 73861 word types
2018-01-16 15:46:51,793 : INFO : PROGRESS: at sentence #570000, processed 12748076 words, keeping 74431 word types
2018-01-16 15:46:51,845 : INFO : PROGRESS: at sentence #580000, processed 12969572 words, keeping 75087 word types
2018-01-16 15:46:51,898 : INFO : PROGRESS: at sentence #590000, processed 13195097 words, keeping 75733 word types
2018-01-16 15:46:51,951 : INFO : PROGRESS: at sentence #600000, processed 13417295 words, keeping 76294 word types
2018-01-16 15:46:52,005 : INFO : PROGRESS: at sentence #610000, processed 13638318 words, keeping 76952 word types
2018-01-16 15:46:52,057 : INFO : PROGRESS: at sentence #620000, processed 13864643 words, keeping 77503 word types
2018-01-16 15:46:52,112 : INFO : PROGRESS: at sentence #630000, processed 14088929 words, keeping 78066 word types
2018-01-16 15:46:52,168 : INFO : PROGRESS: at sentence #640000, processed 14309712 words, keeping 78692 word types
2018-01-16 15:46:52,221 : INFO : PROGRESS: at sentence #650000, processed 14535468 words, keeping 79295 word types
2018-01-16 15:46:52,275 : INFO : PROGRESS: at sentence #660000, processed 14758258 words, keeping 79864 word types
2018-01-16 15:46:52,324 : INFO : PROGRESS: at sentence #670000, processed 14981651 words, keeping 80381 word types
2018-01-16 15:46:52,381 : INFO : PROGRESS: at sentence #680000, processed 15206483 words, keeping 80912 word types
2018-01-16 15:46:52,433 : INFO : PROGRESS: at sentence #690000, processed 15428676 words, keeping 81482 word types
2018-01-16 15:46:52,486 : INFO : PROGRESS: at sentence #700000, processed 15657382 words, keeping 82074 word types
2018-01-16 15:46:52,539 : INFO : PROGRESS: at sentence #710000, processed 15880371 words, keeping 82560 word types
2018-01-16 15:46:52,600 : INFO : PROGRESS: at sentence #720000, processed 16105658 words, keeping 83036 word types
2018-01-16 15:46:52,665 : INFO : PROGRESS: at sentence #730000, processed 16332039 words, keeping 83571 word types
2018-01-16 15:46:52,717 : INFO : PROGRESS: at sentence #740000, processed 16553072 words, keeping 84127 word types
2018-01-16 15:46:52,778 : INFO : PROGRESS: at sentence #750000, processed 16771399 words, keeping 84599 word types
2018-01-16 15:46:52,827 : INFO : PROGRESS: at sentence #760000, processed 16990803 words, keeping 85068 word types
2018-01-16 15:46:52,890 : INFO : PROGRESS: at sentence #770000, processed 17217940 words, keeping 85644 word types
2018-01-16 15:46:52,943 : INFO : PROGRESS: at sentence #780000, processed 17448086 words, keeping 86160 word types
2018-01-16 15:46:52,996 : INFO : PROGRESS: at sentence #790000, processed 17675162 words, keeping 86665 word types
2018-01-16 15:46:53,029 : INFO : collected 86996 word types from a corpus of 17798263 raw words and 795538 sentences
2018-01-16 15:46:53,030 : INFO : Loading a fresh vocabulary
2018-01-16 15:46:53,098 : INFO : mincount=40 retains 11986 unique words (13% of original 86996, drops 75010)
2018-01-16 15:46:53,099 : INFO : min
count=40 leaves 17434026 word corpus (97% of original 17798263, drops 364237)
2018-01-16 15:46:53,134 : INFO : deleting the raw counts dictionary of 86996 items
2018-01-16 15:46:53,137 : INFO : sample=0.001 downsamples 50 most-common words
2018-01-16 15:46:53,138 : INFO : downsampling leaves estimated 12872359 word corpus (73.8% of prior 17434026)
2018-01-16 15:46:53,139 : INFO : estimated required memory for 11986 words and 300 dimensions: 34759400 bytes
2018-01-16 15:46:53,190 : INFO : resetting layer weights
2018-01-16 15:46:53,382 : INFO : training model with 4 workers on 11986 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2018-01-16 15:46:54,392 : INFO : PROGRESS: at 1.11% examples, 714022 words/s, inqsize 8, outqsize 0
2018-01-16 15:46:55,393 : INFO : PROGRESS: at 2.18% examples, 699184 words/s, inqsize 7, outqsize 0
2018-01-16 15:46:56,394 : INFO : PROGRESS: at 3.29% examples, 699844 words/s, inqsize 8, outqsize 1
2018-01-16 15:46:57,396 : INFO : PROGRESS: at 4.40% examples, 701291 words/s, inqsize 8, outqsize 0
2018-01-16 15:46:58,413 : INFO : PROGRESS: at 5.36% examples, 683004 words/s, inqsize 6, outqsize 1
2018-01-16 15:46:59,429 : INFO : PROGRESS: at 6.20% examples, 656443 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:00,467 : INFO : PROGRESS: at 6.92% examples, 625459 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:01,478 : INFO : PROGRESS: at 7.49% examples, 592833 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:02,488 : INFO : PROGRESS: at 8.16% examples, 574609 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:03,515 : INFO : PROGRESS: at 8.77% examples, 555564 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:04,525 : INFO : PROGRESS: at 9.34% examples, 538246 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:05,538 : INFO : PROGRESS: at 9.98% examples, 527211 words/s, inqsize 8, outqsize 2
2018-01-16 15:47:06,543 : INFO : PROGRESS: at 10.72% examples, 523158 words/s, inqsize 6, outqsize 1
2018-01-16 15:47:07,544 : INFO : PROGRESS: at 11.36% examples, 515741 words/s, inqsize 6, outqsize 1
2018-01-16 15:47:08,555 : INFO : PROGRESS: at 12.01% examples, 509411 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:09,559 : INFO : PROGRESS: at 12.77% examples, 507715 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:10,567 : INFO : PROGRESS: at 13.38% examples, 501048 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:11,568 : INFO : PROGRESS: at 14.01% examples, 495678 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:12,574 : INFO : PROGRESS: at 14.81% examples, 496837 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:13,581 : INFO : PROGRESS: at 15.54% examples, 495309 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:14,583 : INFO : PROGRESS: at 16.22% examples, 492305 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:15,592 : INFO : PROGRESS: at 17.05% examples, 494007 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:16,624 : INFO : PROGRESS: at 17.69% examples, 490080 words/s, inqsize 6, outqsize 1
2018-01-16 15:47:17,629 : INFO : PROGRESS: at 18.55% examples, 492393 words/s, inqsize 8, outqsize 1
2018-01-16 15:47:18,633 : INFO : PROGRESS: at 19.47% examples, 496255 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:19,638 : INFO : PROGRESS: at 20.57% examples, 504508 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:20,644 : INFO : PROGRESS: at 21.69% examples, 512113 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:21,652 : INFO : PROGRESS: at 22.81% examples, 519198 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:22,652 : INFO : PROGRESS: at 23.90% examples, 525130 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:23,666 : INFO : PROGRESS: at 25.00% examples, 530932 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:24,678 : INFO : PROGRESS: at 26.02% examples, 534740 words/s, inqsize 6, outqsize 1
2018-01-16 15:47:25,689 : INFO : PROGRESS: at 27.00% examples, 537199 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:26,697 : INFO : PROGRESS: at 28.09% examples, 542190 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:27,719 : INFO : PROGRESS: at 29.20% examples, 546852 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:28,727 : INFO : PROGRESS: at 30.29% examples, 551282 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:29,727 : INFO : PROGRESS: at 31.39% examples, 555560 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:30,737 : INFO : PROGRESS: at 32.48% examples, 559490 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:31,740 : INFO : PROGRESS: at 33.39% examples, 560292 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:32,744 : INFO : PROGRESS: at 34.15% examples, 558488 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:33,753 : INFO : PROGRESS: at 34.72% examples, 553659 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:34,765 : INFO : PROGRESS: at 35.66% examples, 554765 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:35,770 : INFO : PROGRESS: at 36.54% examples, 554888 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:36,770 : INFO : PROGRESS: at 37.28% examples, 553069 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:37,772 : INFO : PROGRESS: at 38.27% examples, 555055 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:38,789 : INFO : PROGRESS: at 39.00% examples, 552807 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:39,794 : INFO : PROGRESS: at 39.42% examples, 546744 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:40,800 : INFO : PROGRESS: at 40.13% examples, 544896 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:41,811 : INFO : PROGRESS: at 40.79% examples, 542164 words/s, inqsize 7, outqsize 1
2018-01-16 15:47:42,815 : INFO : PROGRESS: at 41.80% examples, 544288 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:43,845 : INFO : PROGRESS: at 42.58% examples, 543053 words/s, inqsize 6, outqsize 1
2018-01-16 15:47:44,855 : INFO : PROGRESS: at 43.47% examples, 543344 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:45,865 : INFO : PROGRESS: at 44.29% examples, 542931 words/s, inqsize 7, outqsize 1
2018-01-16 15:47:46,887 : INFO : PROGRESS: at 45.13% examples, 542536 words/s, inqsize 6, outqsize 1
2018-01-16 15:47:47,902 : INFO : PROGRESS: at 45.88% examples, 541443 words/s, inqsize 8, outqsize 1
2018-01-16 15:47:48,907 : INFO : PROGRESS: at 46.84% examples, 542551 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:49,909 : INFO : PROGRESS: at 47.82% examples, 544043 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:50,913 : INFO : PROGRESS: at 48.64% examples, 543833 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:51,916 : INFO : PROGRESS: at 49.30% examples, 541785 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:52,917 : INFO : PROGRESS: at 50.25% examples, 542987 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:53,931 : INFO : PROGRESS: at 51.28% examples, 544859 words/s, inqsize 6, outqsize 1
2018-01-16 15:47:54,941 : INFO : PROGRESS: at 52.36% examples, 547415 words/s, inqsize 8, outqsize 1
2018-01-16 15:47:55,946 : INFO : PROGRESS: at 53.26% examples, 547859 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:56,951 : INFO : PROGRESS: at 53.92% examples, 545889 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:57,953 : INFO : PROGRESS: at 54.66% examples, 544813 words/s, inqsize 8, outqsize 0
2018-01-16 15:47:58,970 : INFO : PROGRESS: at 55.34% examples, 543078 words/s, inqsize 7, outqsize 0
2018-01-16 15:47:59,994 : INFO : PROGRESS: at 56.06% examples, 541667 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:01,081 : INFO : PROGRESS: at 56.81% examples, 540118 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:02,081 : INFO : PROGRESS: at 57.52% examples, 538876 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:03,133 : INFO : PROGRESS: at 58.04% examples, 535610 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:04,150 : INFO : PROGRESS: at 58.70% examples, 533930 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:05,159 : INFO : PROGRESS: at 59.38% examples, 532464 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:06,167 : INFO : PROGRESS: at 60.09% examples, 531451 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:07,169 : INFO : PROGRESS: at 60.66% examples, 529223 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:08,170 : INFO : PROGRESS: at 61.30% examples, 527637 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:09,190 : INFO : PROGRESS: at 62.16% examples, 527773 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:10,193 : INFO : PROGRESS: at 63.01% examples, 527940 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:11,194 : INFO : PROGRESS: at 63.82% examples, 527735 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:12,196 : INFO : PROGRESS: at 64.73% examples, 528454 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:13,205 : INFO : PROGRESS: at 65.71% examples, 529637 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:14,214 : INFO : PROGRESS: at 66.76% examples, 531234 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:15,218 : INFO : PROGRESS: at 67.66% examples, 531862 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:16,236 : INFO : PROGRESS: at 68.65% examples, 533082 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:17,243 : INFO : PROGRESS: at 69.61% examples, 534078 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:18,253 : INFO : PROGRESS: at 70.60% examples, 535208 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:19,259 : INFO : PROGRESS: at 71.52% examples, 535908 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:20,264 : INFO : PROGRESS: at 72.48% examples, 536858 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:21,276 : INFO : PROGRESS: at 73.49% examples, 538157 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:22,281 : INFO : PROGRESS: at 74.45% examples, 539053 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:23,291 : INFO : PROGRESS: at 75.40% examples, 539748 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:24,307 : INFO : PROGRESS: at 76.41% examples, 540864 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:25,312 : INFO : PROGRESS: at 77.07% examples, 539580 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:26,327 : INFO : PROGRESS: at 77.34% examples, 535557 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:27,334 : INFO : PROGRESS: at 77.75% examples, 532658 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:28,360 : INFO : PROGRESS: at 78.22% examples, 530101 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:29,367 : INFO : PROGRESS: at 78.79% examples, 528373 words/s, inqsize 7, outqsize 1
2018-01-16 15:48:30,379 : INFO : PROGRESS: at 79.37% examples, 526655 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:31,388 : INFO : PROGRESS: at 79.94% examples, 525064 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:32,423 : INFO : PROGRESS: at 80.43% examples, 522714 words/s, inqsize 7, outqsize 1
2018-01-16 15:48:33,470 : INFO : PROGRESS: at 81.03% examples, 521139 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:34,589 : INFO : PROGRESS: at 81.46% examples, 518088 words/s, inqsize 8, outqsize 3
2018-01-16 15:48:35,615 : INFO : PROGRESS: at 82.00% examples, 516208 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:36,651 : INFO : PROGRESS: at 82.48% examples, 514041 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:37,656 : INFO : PROGRESS: at 82.74% examples, 510681 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:38,674 : INFO : PROGRESS: at 83.25% examples, 508837 words/s, inqsize 7, outqsize 1
2018-01-16 15:48:39,712 : INFO : PROGRESS: at 83.68% examples, 506446 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:40,742 : INFO : PROGRESS: at 84.21% examples, 504744 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:41,782 : INFO : PROGRESS: at 84.79% examples, 503306 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:42,817 : INFO : PROGRESS: at 85.23% examples, 501112 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:43,820 : INFO : PROGRESS: at 85.70% examples, 499309 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:44,821 : INFO : PROGRESS: at 86.11% examples, 497219 words/s, inqsize 5, outqsize 2
2018-01-16 15:48:45,864 : INFO : PROGRESS: at 86.69% examples, 495810 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:46,892 : INFO : PROGRESS: at 87.00% examples, 493097 words/s, inqsize 5, outqsize 2
2018-01-16 15:48:47,905 : INFO : PROGRESS: at 87.40% examples, 491006 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:48,922 : INFO : PROGRESS: at 87.91% examples, 489499 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:49,937 : INFO : PROGRESS: at 88.35% examples, 487708 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:50,953 : INFO : PROGRESS: at 88.84% examples, 486196 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:51,976 : INFO : PROGRESS: at 89.28% examples, 484379 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:52,996 : INFO : PROGRESS: at 89.74% examples, 482782 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:53,997 : INFO : PROGRESS: at 90.46% examples, 482607 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:55,006 : INFO : PROGRESS: at 91.37% examples, 483410 words/s, inqsize 7, outqsize 0
2018-01-16 15:48:56,006 : INFO : PROGRESS: at 92.24% examples, 484117 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:57,034 : INFO : PROGRESS: at 92.97% examples, 483891 words/s, inqsize 6, outqsize 1
2018-01-16 15:48:58,041 : INFO : PROGRESS: at 93.95% examples, 485074 words/s, inqsize 8, outqsize 0
2018-01-16 15:48:59,043 : INFO : PROGRESS: at 94.86% examples, 485869 words/s, inqsize 7, outqsize 0
2018-01-16 15:49:00,048 : INFO : PROGRESS: at 95.78% examples, 486686 words/s, inqsize 6, outqsize 1
2018-01-16 15:49:01,052 : INFO : PROGRESS: at 96.60% examples, 486987 words/s, inqsize 8, outqsize 0
2018-01-16 15:49:02,232 : INFO : PROGRESS: at 97.03% examples, 484713 words/s, inqsize 7, outqsize 0
2018-01-16 15:49:03,236 : INFO : PROGRESS: at 97.56% examples, 483577 words/s, inqsize 7, outqsize 0
2018-01-16 15:49:04,257 : INFO : PROGRESS: at 97.99% examples, 481952 words/s, inqsize 7, outqsize 0
2018-01-16 15:49:05,266 : INFO : PROGRESS: at 98.59% examples, 481167 words/s, inqsize 8, outqsize 0
2018-01-16 15:49:06,281 : INFO : PROGRESS: at 99.00% examples, 479447 words/s, inqsize 7, outqsize 0
2018-01-16 15:49:07,319 : INFO : PROGRESS: at 99.49% examples, 478107 words/s, inqsize 6, outqsize 1
2018-01-16 15:49:08,337 : INFO : PROGRESS: at 99.97% examples, 476801 words/s, inqsize 3, outqsize 1
2018-01-16 15:49:08,340 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-01-16 15:49:08,366 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-01-16 15:49:08,371 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-01-16 15:49:08,383 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-01-16 15:49:08,384 : INFO : training on 88991315 raw words (64362493 effective words) took 135.0s, 476791 effective words/s

# 학습이 완료 되면 필요없는 메모리를 unload 시킨다.
model.init_sims(replace=True)

model_name = '300features_40minwords_10text'
# model_name = '300features_50minwords_20text'
model.save(model_name)

2018-01-16 15:49:08,408 : INFO : precomputing L2-norms of word weight vectors
2018-01-16 15:49:08,588 : INFO : saving Word2Vec object under 300features40minwords10text, separately None
2018-01-16 15:49:08,589 : INFO : not storing attribute syn0norm
2018-01-16 15:49:08,592 : INFO : not storing attribute cumtable
2018-01-16 15:49:09,237 : INFO : saved 300features
40minwords_10text

모델 결과 탐색

Exploring the Model Results

# 유사도가 없는 단어 추출
model.wv.doesnt_match('man woman child kitchen'.split())

'kitchen'

model.wv.doesnt_match("france england germany berlin".split())

2018-01-16 15:49:09,270 : WARNING : vectors for words {'france', 'germany'} are not present in the model, ignoring these words

'berlin'

# 가장 유사한 단어를 추출
model.wv.most_similar("man")

[('woman', 0.6346021890640259),
('businessman', 0.5241227746009827),
('ladi', 0.5211358070373535),
('lad', 0.5165952444076538),
('millionair', 0.49945348501205444),
('farmer', 0.4661444127559662),
('men', 0.46612322330474854),
('boxer', 0.4650130271911621),
('widow', 0.4628654718399048),
('policeman', 0.4608388841152191)]

model.wv.most_similar("queen")

[('princess', 0.5980467200279236),
('goddess', 0.5464529991149902),
('victoria', 0.5335069894790649),
('seductress', 0.5329927802085876),
('latifah', 0.5239185690879822),
('stepmoth', 0.5189395546913147),
('bride', 0.5187833905220032),
('nun', 0.5187197327613831),
('maria', 0.5130434632301331),
('maid', 0.5088884234428406)]

# model.wv.most_similar("awful")
model.wv.most_similar("film")

[('movi', 0.8506779074668884),
('flick', 0.6001852750778198),
('documentari', 0.5914444327354431),
('pictur', 0.5582413673400879),
('cinema', 0.5267403721809387),
('sequel', 0.4961458146572113),
('masterpiec', 0.4919975996017456),
('it', 0.483654648065567),
('genr', 0.4749864637851715),
('effort', 0.4723205268383026)]

# model.wv.most_similar("happy")
model.wv.most_similar("happi") # stemming 처리 시 

[('unhappi', 0.42357802391052246),
('satisfi', 0.4157138466835022),
('sad', 0.407173216342926),
('glad', 0.3871932029724121),
('lucki', 0.3831247389316559),
('afraid', 0.37879762053489685),
('anxious', 0.36728110909461975),
('bitter', 0.3594135642051697),
('upset', 0.3593282699584961),
('joy', 0.3521401286125183)]

Word2Vec으로 벡터화 한 단어를 t-SNE 를 통해 시각화

# 참고 https://stackoverflow.com/questions/43776572/visualise-word2vec-generated-from-gensim
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt
import gensim 
import gensim.models as g

# 그래프에서 마이너스 폰트 깨지는 문제에 대한 대처
mpl.rcParams['axes.unicode_minus'] = False

model_name = '300features_40minwords_10text'
model = g.Doc2Vec.load(model_name)

vocab = list(model.wv.vocab)
X = model[vocab]

print(len(X))
print(X[0][:10])
tsne = TSNE(n_components=2)

# 100개의 단어에 대해서만 시각화
X_tsne = tsne.fit_transform(X[:100,:])
# X_tsne = tsne.fit_transform(X)

2018-01-16 15:49:09,712 : INFO : loading Doc2Vec object from 300features40minwords10text
2018-01-16 15:49:09,977 : INFO : loading wv recursively from 300features40minwords10text.wv.* with mmap=None
2018-01-16 15:49:09,978 : INFO : setting ignored attribute syn0norm to None
2018-01-16 15:49:09,980 : INFO : setting ignored attribute cumtable to None
2018-01-16 15:49:09,981 : INFO : loaded 300features
40minwords_10text

11986
[ 0.08096986 -0.05534305 0.16620259 0.04356802 -0.00589749 -0.02075242
0.02598226 0.07634024 0.00546919 0.08736426]

df = pd.DataFrame(X_tsne, index=vocab[:100], columns=['x', 'y'])
df.shape

(100, 2)

df.head(10)
x y
with 3.399101 -2.212670
all -4.684950 1.340348
this -3.472565 -0.517563
stuff -5.893897 0.499611
go -4.214260 -6.399875
down 4.853176 -5.858385
at 2.976999 -5.427274
the -4.934722 2.948591
moment -3.556630 4.883556
mj 5.769177 4.133462
fig = plt.figure()
fig.set_size_inches(40, 20)
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos, fontsize=30)
plt.show()

png

강의에 등록된 질문이 없습니다. 궁금한 부분이 있으면 주저하지 말고 무엇이든 물어보세요.