特级无码a级毛片特黄,无码精品国产一区二区三区免费

概述

從今天開始我們將開啟一段自然語言處理 (NLP) 的旅程. 自然語言處理可以讓來處理, 理解, 以及運用人類的語言, 實現(xiàn)機器語言和人類語言之間的溝通橋梁.

詞向量

我們先來說說詞向量究竟是什么. 當我們把文本交給算法來處理的時候, 計算機并不能理解我們輸入的文本, 詞向量就由此而生了. 簡單的來說, 詞向量就是將詞語轉(zhuǎn)換成數(shù)字組成的向量.

當我們描述一個人的時候, 我們會使用身高體重等種種指標, 這些指標就可以當做向量. 有了向量我們就可以使用不同方法來計算相似度.

那我們?nèi)绾蝸砻枋稣Z言的特征呢? 我們把語言分割成一個個詞, 然后在詞的層面上構(gòu)建特征.

詞向量維度

詞向量的維度越高, 其所能提供的信息也就越多, 計算結(jié)果的可靠性就更值得信賴.

50 維的詞向量:

用熱度圖表示一下:

從上圖我們可以看出, 相似的詞在特征表達中比較相似. 由此也可以證明詞的特征是有意義的.

代碼實現(xiàn)

預(yù)處理

import numpy as np
import pandas as pd
import itertools
import re
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import nltk
# 停用詞
stop_words = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="\n", names=["stop_words"])
stop_words = [word.strip() for word in stop_words["stop_words"].values]
def load_train_data():
 """讀取訓(xùn)練數(shù)據(jù)"""
 # 語料
 data = pd.read_csv("data/labeledTrainData.tsv", sep="\t", escapechar="\\")
 print(data[:5])
 print("訓(xùn)練評論數(shù)量:", len(data))  # 25,000
 return data
def load_test_data():
 # 語料
 data = pd.read_csv("data/unlabeledTrainData.tsv", sep="\t", escapechar="\\")
 print("測試評論數(shù)量:", len(data))  # 50,000
 return data
def pre_process(text):
 # 去除網(wǎng)頁鏈接
 text = BeautifulSoup(text, "html.parser").get_text()
 # 去除標點
 text = re.sub("[^a-zA-Z]", " ", text)
 # 分詞
 words = text.lower().split()
 # 去除停用詞
 words = [w for w in words if w not in stop_words]
 return " ".join(words)
def split_train_data():
 # 讀取文件
 data = pd.read_csv("data/train.csv")
 print(data.head())
 # 抽取bag of words特征
 vec = CountVectorizer(max_features=5000)
 # 擬合
 vec.fit(data["review"])
 # 轉(zhuǎn)換
 train_data_features = vec.transform(data["review"]).toarray()
 print(train_data_features.shape)
 # 詞袋
 print(vec.get_feature_names())
 # 分割數(shù)據(jù)集
 X_train, X_test, y_train, y_test = train_test_split(train_data_features, data["sentiment"], test_size=0.2,
random_state=0)
 return X_train, X_test, y_train, y_test
def test():
 # 讀取測試數(shù)據(jù)
 data = pd.read_csv("data/test.csv")
 print(data.head())
 tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
 # 分詞
 def split_sentences(review):
  raw_sentences = tokenizer.tokenize(review.strip())
  return sentences
 sentences = sum(data["review"][:10].apply(split_sentences), [])
def visualize(cm, classes, title="Confusion matrix", cmap=plt.cm.Blues):
 plt.imshow(cm, interpolation="nearest", cmap=cmap)
 plt.title(title)
 plt.colorbar()
 tick_marks = np.arange(len(classes))
 plt.xticks(tick_marks, classes, rotation=0)
 plt.yticks(tick_marks, classes)
 thresh = cm.max()
 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
  plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
 plt.tight_layout()
 plt.ylabel("True label")
 plt.xlabel("Predicted label")
 plt.show()
if __name__ == '__main__':
 # # 處理訓(xùn)練數(shù)據(jù)
 # train_data = load_train_data()
 # train_data["review"] = train_data["review"].apply(pre_process)
 # print(train_data.head())
 #
 # # 保存
 # train_data.to_csv("data/train.csv")
 # # 處理訓(xùn)練數(shù)據(jù)
 # test_data = load_test_data()
 # test_data["review"] =  test_data["review"].apply(pre_process)
 # print( test_data.head())
 #
 # # 保存
 # test_data.to_csv("data/test.csv")
 split_train_data()

主程序

import pandas as pd
import nltk
from gensim.models.word2vec import Word2Vec
def pre_process():
 """預(yù)處理"""
 # 讀取測試數(shù)據(jù)
 data = pd.read_csv("data/test.csv")
 print(data.head())
 # 存放結(jié)果
 result = []
 # 分詞
 for line in data["review"]:
  result.append(nltk.word_tokenize(line))
 return result
def main():
 # 獲取分詞語料
 word_list = pre_process()
 # 設(shè)定詞向量訓(xùn)練的參數(shù)
 num_features = 300  # Word vector dimensionality
 min_word_count = 40  # Minimum word count
 num_workers = 4  # Number of threads to run in parallel
 context = 10  # Context window size
 model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)
 # 創(chuàng)建w2c模型
 model = Word2Vec(sentences=word_list, workers=num_workers,
vector_size=num_features, min_count=min_word_count,
window=context)
 # 保存模型
 model.save(model_name)
def test():
 # 加載模型
 model = Word2Vec.load("300features_40minwords_10context.model")
 # 不匹配
 match = model.wv.doesnt_match(['man','woman','child','kitchen'])
 print(match)
 # 最相似
 print(model.wv.most_similar("boy"))
 print(model.wv.most_similar("bad"))
if __name__ == '__main__':
 test()

輸出結(jié)果:

2021-09-16 20:36:40.791181: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Unnamed: 0id  sentiment  review
0  0  5814_8 1  stuff moment mj ve started listening music wat...
1  1  2381_9 1  classic war worlds timothy hines entertaining ...
2  2  7759_3 0  film starts manager nicholas bell investors ro...
3  3  3630_4 0  assumed praised film filmed opera didn read do...
4  4  9495_8 1  superbly trashy wondrously unpretentious explo...
73423
[[15958623 12368  4459622835 30152  2097  2408 35364 57143
 892  2997766 42223967266 25276157108696  1631198
2576  9850  3745 27 52  3789  9503696526 52354862
 474 38  2101 11027696  6456 22390969  5873  5376  4044
 623  1401  2069718618 92 96138  1345714 96 18
 123  1770518  3314354983  1888520 83 73983  2
  28 28635  1044  2054401  1071 85  8565  8957  7226804 46
 224447  2113  2691  5742 10  5  3217943  5045980373
  28873438389 41 23 19 56122  9253 27176
2149 19 90 57144 53  4874696  6558136  2067 10682 48
 518  1482  9  3668  1587  3786  2110 10506 25150 20744
 340 33316 17  4824  3892978 14 10150  2596766 42223
5082  4784700198  6276  5254700198  2334696 20879  5
  86 30  2583  2872 30601 30 86 28 83 73 32
  96 18  2224708 30167  7  3791216 45513
2  2310513  1860  4536  1925414  1321578  7434851696
 997  5354 57145162 30  2 91  1839]
 [ 0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  1357684
  28  3027 10371  5801 20987 21481 19800  1  3027 10371 21481 19800
1719204 49168250  7355  1547374401  5415 24  1719
  24 49168  7355  1547  3610 21481 19800123204 49168
1102  1547656213  5432  5183 61  4 66166 20 36 56
7  5183  2025116  5031 11 45782]
 [ 0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  2189  1586
2189 15  1855615400  5394  3797 23866  2892481  2892810
  22020 17820  1741231 20746  2028  1040  6089816  5555
  41772  1762 26811288  8796 45]
 [ 0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0 85310  1734 78  1906 78  1906  1412  1985
  78  7644  1412244  9287  7092  6374  2584  6183  3795  3080  1288
2217  3534  6005  4851  1543762  1797 26144699237  6745  7
1288  1415  9003  5623237  1669 17987874421234  1278347
9287  1609  7100  1065 75  9800  3344 76  5021 47380  3015
  14366  6523  1396851 22330  3465 20861  7106  6374340 60 19035
3089  5081  3  7  1695 10735  3582 92  6374176  8348 60
1491 11540 28826  1847464  4099 22  3561 51 22  1538  1027
  38926  2195  1966  3089 33 19894287142  6374184 37  4025
  67325 37421549 21976 28  7744  2466 31533 27  2836
1339  6374 14805  1670  4666 60 33 12]
 [ 0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  0  0  0
0  0  0  0  0  0  0  0  0  1 27 52
4639  9  5774  1545  8575855 10463  2688 21019  1542  1701653
9765  9189706  2212 18342566437  2639  4311  4504 26110
 307496893317  1 27 52587]]
[[0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]]
2021-09-16 20:36:46.488438: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-16 20:36:46.489070: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu
2021-09-16 20:36:46.489097: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-16 20:36:46.489128: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (313c6f2d15e2): /proc/driver/nvidia/version does not exist
2021-09-16 20:36:46.489488: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-16 20:36:46.493241: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Model: "sequential"
_________________________________________________________________
Layer (type)  Output Shape  Param #
=================================================================
embedding (Embedding)  (None, None, 200)14684800  
_________________________________________________________________
lstm (LSTM)(None, 200)320800 
_________________________________________________________________
dropout (Dropout)(None, 200)0
_________________________________________________________________
dense (Dense) (None, 64) 12864  
_________________________________________________________________
dense_1 (Dense)  (None, 2)  130 
=================================================================
Total params: 15,018,594
Trainable params: 15,018,594
Non-trainable params: 0
_________________________________________________________________
None
2021-09-16 20:36:46.792534: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-09-16 20:36:46.830442: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz
Epoch 1/2
313/313 [==============================] - 101s 315ms/step - loss: 0.5581 - accuracy: 0.7229 - val_loss: 0.3703 - val_accuracy: 0.8486
Epoch 2/2
313/313 [==============================] - 98s 312ms/step - loss: 0.2174 - accuracy: 0.9195 - val_loss: 0.3016 - val_accuracy: 0.8822

以上就是Python機器學(xué)習NLP自然語言處理Word2vec電影影評建模的詳細內(nèi)容，更多關(guān)于NLP自然語言處理的資料請關(guān)注本站其它相關(guān)文章！

版權(quán)聲明：本站文章來源標注為YINGSOO的內(nèi)容版權(quán)均為本站所有，歡迎引用、轉(zhuǎn)載，請保持原文完整并注明來源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站，禁止在非maisonbaluchon.cn所屬的服務(wù)器上建立鏡像，否則將依法追究法律責任。本站部分內(nèi)容來源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來，僅供學(xué)習參考，不代表本站立場，如有內(nèi)容涉嫌侵權(quán)，請聯(lián)系alex-e#qq.com處理。

相關(guān)文章

動態(tài)撥號：關(guān)鍵詞排名下降是啥緣故，快速提高排名怎樣做

排名優(yōu)化：網(wǎng)站排名優(yōu)化方法有什么，如何做有效果

老域名：怎樣才算老域名，老域名建站有什么影響

內(nèi)容優(yōu)化：關(guān)鍵字排名要做哪些方面的優(yōu)化，怎樣做

技巧：網(wǎng)站轉(zhuǎn)化率究竟是什么，有什么提升的技巧

一下吧：外貿(mào)站優(yōu)化有哪些基本的做法和注意事項

概要：競價推廣費用大概要多少呢，競價推廣好不好

一下吧：SEO中site是什么意思，作用和應(yīng)用是怎樣的

郵箱：付費郵箱有哪些優(yōu)勢，付費郵箱挑選要考慮什么

集群是什么意思：集群是什么意思，都有哪些優(yōu)勢呢