Раскрытие идей: НЛП-анализ обзоров Amazon

Простой проект для определения тональности отзывов на Amazon.

В нынешнюю цифровую эпоху онлайн-покупки стали неотъемлемой частью нашей жизни, и Amazon считается одной из крупнейших в мире платформ электронной коммерции. Благодаря миллионам продуктов и отзывам от разных клиентов Amazon предлагает кладезь ценной информации. Но задумывались ли вы когда-нибудь, как компании могут разобраться во всех этих отзывах клиентов? Как они могут понять настроения и предпочтения клиентов, чтобы улучшить свои продукты и услуги?

Анализ настроений, подмножество НЛП, позволяет нам автоматически определять настроение, выраженное в фрагменте текста, независимо от того, является ли оно положительным, отрицательным или нейтральным. Используя алгоритмы машинного обучения и лингвистические методы, анализ настроений раскрывает чувства, скрытые в отзывах клиентов, помогая компаниям принимать решения на основе данных и повышать качество обслуживания клиентов.

Импорт соответствующих библиотек

#importing relevant libraries
import numpy as np
import pandas as pd 
from bs4 import BeautifulSoup
import re
import nltk
from nltk.util import ngrams
nltk.download('all')

#library for visualisation
import plotly.express as px
pd.options.plotting.backend = "plotly"

#mounting files
from google.colab import drive
drive.mount('/content/drive')

Понимание данных

Общее количество отзывов
% звездных отзывов + визуализация
Распределение длины символов + визуализация
Дополнительный:

# quick information of dataset
pd.set_option('display.max_columns', None)
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/NLP_assignment/product_reviews.csv")
df.info()
df

Нахождение общего количества отзывов

num_review = len(df.index)
print("The total number of reviews is: ",num_review)

Понимание процента каждого звездного рейтинга и отображение распределения рейтингов

#percentage of star reviews
star_percent = df.groupby('stars').size()/df['stars'].count()*100
print("These percentage of each star is:", )
print(star_percent)
print()

#Displaying the rating distribution
df['stars'].plot(
    kind='hist',
    title='Review Rating Distribution')


# We can observe that the rating of stars in descending order is:
# 5 stars > 1 star > 4 stars > 2 stars > 3 stars
# 5 stars has the highest rating count (>300 reviews), which is 3 times higher than 1 star ratings

Понимание распределения длины символов отзыва

#displaying distribution of character lengths for reviews
df['review_len'] = df['reviews'].astype(str).apply(len)
df['review_len'].plot(
    kind='hist',
    title='Distribution of review character lengths')


## From the histogram, we can understand that most customers (>100) leave reviews with character lengths
## within 80 - 99 words

Предварительная обработка текста

удалить теги ‹br›
замена всех букв на строчные
удаление всех номеров
удаление знаков препинания
удаление пробелов
удаление стоп-слов
лемматизация
токенизация — униграмма, биграмма, триграмма

Увеличение ширины столбцов для просмотра данных ¶

pd.set_option('display.max_colwidth', None)
df

Проверка набора данных на отсутствующие и нулевые значения ¶

#check for missing values:
for col in df.columns:
 print(col, df[col].isnull().sum())
#there are no missing values
#check for null values:
df.isnull().sum()

#there are no null values

Предварительная обработка данных для очистки

Для предварительной обработки текста я использовал метод использования функции для запуска текста из колонки отзывов в качестве аргумента. Это упрощает процесс и в то же время делает код более эффективным, поскольку ему не нужно выполнять несколько циклов for.

Шаги при предварительной обработке текста:

Удалите URL-адреса, в наборе данных есть ссылки Amazon.
Удалите HTML-теги, присутствуют такие теги, как «‹ br/›»
Преобразовать все буквы в нижний регистр
Удалить все номера
Удалить все знаки препинания
Удалить стоп-слова
Удалите все пробелы в предложениях

#Pre-processing reviews into 'cleaned' column

from nltk.corpus import stopwords

# Define a preprocessing function that takes a string as input and returns the preprocessed string
def preprocess_text(text):
    # Remove URL
    text = re.sub(r"https?://\S+", " ", text)

    # Remove HTML tags
    text = re.sub(r"<[^>]*>", " ", text)

    #changing all letters to lowercase
    text = text.lower()

    #removing all numbers
    text = re.sub('\d+', ' ', text)

    #removing punctuation
    text =  re.sub('[^\w\s]+', ' ', text)

    #removing stop wordsj
    ## defining stop words
    stop_words = stopwords.words('english')
    ### Split the text into words
    words = text.split()
    ### Filter the list of words to remove stop words
    filtered_words = [word for word in words if word not in stop_words]
    ### Join the filtered words back into a single string
    text = " ".join(filtered_words)
  
    #removing whitespace
    text = text.strip()
    
    return text

# Use the `apply` method to apply the preprocessing function to the "text" column and store the result in a new column named "cleaned"
df["cleaned"] = df["reviews"].apply(preprocess_text)

df.head()

Лемматизация и токенизация¶

Лемматизация предварительно обработанного текста и одновременно его токенизация с помощью функции split() в новый столбец с именем review_words.

Этот столбец будет содержать лемматизированные и размеченные униграммы и сформирует новый столбец для объединения слов для формирования биграмм, а также триграмм для лучшего понимания данных.

#lemmatization - normalising the text by converting to base form. Example, plays and play.
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

# Use the `apply` method with a lambda function to lemmatize the words in each row of the "text" column
df["review_words"] = df["cleaned"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x.split()])

df

Токенизация очищенного текста в униграммы, биграммы и триграммы

создание биграмм, чтобы потом понять, какой дуэт слов встречается чаще всего среди всех обзоров.

#tokenizing reviews into bigrams
df['review_bigram'] = df['review_words'].apply(lambda row: list(nltk.ngrams(row, 2)))
df

создание биграмм, чтобы позже понять, какие три слова чаще всего произносятся среди всех обзоров.

#tokenizing reviews into trigrams
df['review_trigram'] = df['review_words'].apply(lambda row: list(nltk.ngrams(row, 3)))
df

Понимание данных

Топ 20 униграмм среди отзывов
Топ 20 униграмм среди отзывов
Топ 20 униграмм среди отзывов

Лучшие Unigrams, найденные в обзорах:

Во-первых, я хочу понять, какие термины чаще всего встречаются в обзорах через облако слов.

Затем я хочу погрузиться глубже, чтобы понять частоту общих терминов, присутствующих в облаке слов.

#most common words (unigram)

#wordcloud
from wordcloud import WordCloud
import collections
import matplotlib.pyplot as plt
from collections import Counter

# Use the `Counter` class to count the frequency of each word in the "tokens" column
counter = collections.Counter([word for row in df["review_words"] for word in row])

# Create a WordCloud object with the word frequencies
wordcloud = WordCloud(max_words=50).generate_from_frequencies(counter)

# Plot the word cloud
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()




#further break down
print("\n\n\n\n\n Let's break down the top 20 words and understand how many times these words occur \n\n\n\n\n")

value = []
for sentence in df["review_words"]:
  for word in sentence:
    value.append(word)

print(value.count)
# learning more about the words in the word cloud such as how many words of it are found within the cleaned text
# Use the `Counter` class to count the number of occurrences of each word in the list
counter = Counter(value)
# Use the `most_common` method on the `Counter` object to get the top `n` words from the list
top_n_words = counter.most_common(20)
words, frequencies = zip(*top_n_words)
plt.bar(words, frequencies)

# Add labels and title
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.title('Most Frequent Words')
plt.xticks(rotation=90)

Оценка: мы видим, что из 20 лучших униграмм 3 наиболее просматриваемых униграммы: «Кофе», «чашка» и «Каппаччино», где встречаемость каждого термина превышает 400, 350 и 300.

Это означает, что большинство людей комментируют кофе, капучино и чашки или, возможно, чашки k, которые являются одним из продуктов.

Давайте разберемся дальше, взглянув на биграммы, чтобы увидеть лучшую словесную ассоциацию.

Лучшие биграммы в обзорах:

Во-вторых, я хочу понять, какие биграммы чаще всего встречаются среди отзывов через облако слов.

Чтобы добиться этого, я использовал «_» в качестве разделителя, чтобы слова соединились вместе как одно слово. Затем я применяю ту же технику, что и для облака слов униграмм, чтобы продемонстрировать 20 самых распространенных биграмм.

Затем я хочу погрузиться глубже, чтобы понять частоту общих биграмм, присутствующих в облаке слов.

df["sep_review_bigram"] = df["review_bigram"].apply(lambda x: ["_".join(a) for a in x])
df


#most common words (bigram)

#wordcloud
from wordcloud import WordCloud
import collections
import matplotlib.pyplot as plt

# Use the `Counter` class to count the frequency of each word in the "tokens" column
counter = collections.Counter([word for row in df["sep_review_bigram"] for word in row])

# Create a WordCloud object with the word frequencies
wordcloud = WordCloud(max_words=50).generate_from_frequencies(counter)

# Plot the word cloud
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

print("\n\n\n\n\n Let's break down the top 20 bigrams and understand how many times these words occur \n\n\n\n\n")

value2 = []
for sentence in df["sep_review_bigram"]:
  for bigram in sentence:
    value2.append(bigram)


# learning more about the words in the word cloud such as how many words of it are found within the cleaned text
# Use the `Counter` class to count the number of occurrences of each word in the list
counter = Counter(value2)
# Use the `most_common` method on the `Counter` object to get the top `n` words from the list
top_bi_words = counter.most_common(20)
bi_words, bi_frequencies = zip(*top_bi_words)
plt.bar(bi_words, bi_frequencies)

# Add labels and title
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.title('Most Frequent Bigrams')
plt.xticks(rotation=90)

Оценка: мы видим, что из 20 лучших униграмм 3 самых популярных: «k_cup», «gas_station» и «french_vanilla», с терминами примерно 170, 80 и 70.

Это означает, что люди комментируют продукт k-cups значительно чаще, чем другие биграммы. Далее идет gas_station, что может означать филиал, продающий продукт. И, наконец, французская ваниль, которая может означать аромат продукта.

Лучшие слова триграммы

В-третьих, я хочу понять, какие триграммы чаще всего встречаются среди отзывов через облако слов.

Чтобы добиться этого, я использовал «_» в качестве разделителя, чтобы слова соединились вместе как одно слово. Затем я применяю ту же технику, что и для облака слов униграмм, чтобы продемонстрировать 20 самых распространенных триграмм.

Затем я хочу погрузиться глубже, чтобы понять частоту общих триграмм, присутствующих в облаке слов.

Это позволяет мне получить больше информации по сравнению с простым отображением верхней униграммы и биграммы, поскольку просмотр большего количества слов, связанных вместе, может позволить мне понять, какие термины наиболее популярны среди клиентов.

#top trigram gram words (visualisation)
df["sep_review_trigram"] = df["review_trigram"].apply(lambda x: ["_".join(a) for a in x])
df


#most common words (trigram)

#wordcloud
from wordcloud import WordCloud
import collections
import matplotlib.pyplot as plt

# Use the `Counter` class to count the frequency of each word in the "tokens" column
counter = collections.Counter([word for row in df["sep_review_trigram"] for word in row])

# Create a WordCloud object with the word frequencies
wordcloud = WordCloud(max_words=50).generate_from_frequencies(counter)

# Plot the word cloud
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
print("\n\n\n\n\n Let's break down the top 20 trigrams and understand how many times these words occur \n\n\n\n\n")

value3 = []
for sentence in df["sep_review_trigram"]:
  for trigram in sentence:
    value3.append(trigram)


# learning more about the words in the word cloud such as how many words of it are found within the cleaned text
# Use the `Counter` class to count the number of occurrences of each word in the list
counter = Counter(value3)
# Use the `most_common` method on the `Counter` object to get the top `n` words from the list
top_tri_words = counter.most_common(20)
tri_words, tri_frequencies = zip(*top_tri_words)
plt.bar(tri_words, tri_frequencies)

# Add labels and title
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.title('Most Frequent Trigrams')
plt.xticks(rotation=90)

Оценка:

Я могу заметить, что чем больше слов связывается вместе (триграммы), тем реже они становятся.

Мы видим, что из 20 лучших униграмм 3 наиболее просматриваемых: «french_vanilla_cappuccino», «gas_station_cappuccino» и «grove_square cappucino» с встречаемостью около 25, 21 и 12 терминов.

Это означает, что самым популярным ароматом является французская ваниль, за которой следуют ароматы на заправочных станциях и в гроув-сквер.

Сохранение очищенного набора данных

удаление всех столбцов, кроме user_id, stars и review_words, поскольку они предварительно обработаны, лемматизированы и токенизированы

#dropping columns that are not required for text classification
df = df.drop("reviews", axis = 1)
df = df.drop("review_len", axis = 1)
df = df.drop("cleaned", axis = 1)
df = df.drop("review_bigram", axis = 1)
df = df.drop("review_trigram", axis = 1)
df = df.drop("sep_review_bigram", axis = 1)
df = df.drop("sep_review_trigram", axis = 1)
df
#exporting the cleansed dataset
df.to_csv("/content/drive/MyDrive/Colab Notebooks/NLP_assignment/cleaned_product_review.csv")

Удаление отзывов с 3 звездами

#removing reviews which stars are 3
df = df[df["stars"] != 3]
df

создание новой колонки настроений
пометка 1,2 звезды как отрицательная и 3,4 звезды как положительная

#creating sentiment column 
df["sentiment"] = ''

#declaring if review is positive or negative sentiment
df['sentiment'] = np.where(df['stars'] >= 4, "positive", df['sentiment'])
df['sentiment'] = np.where(df['stars'] <= 2, "negative", df['sentiment'])

df

Удаление столбца звезд

#dropping "stars" column
df = df.drop("stars", axis=1)
df
#finding the total number of each sentiment, both positive and negative for all reviews
print('Total number of reviews: ', len(df.index))
print('Total number of positive reviews: ',df.loc[df.sentiment == 'positive', 'sentiment'].count())
print('Total number of negative reviews: ', df.loc[df.sentiment == 'negative', 'sentiment'].count())


#we can see that there are more number of positive reviews as compared to the negative reviews

Разделение данных на тестовые и обучающие

Разделение набора данных на наборы для обучения и тестирования для модели

#Split the data into training and test sets
x = df['review_words']
y = df['sentiment']


from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

print("The size of original dataset: ", x.shape)
print("The size of training dataset: ", X_train.shape)
print("The size of test dataset: ", X_test.shape)

Разработка функций

использование вектора подсчета для извлечения признаков

# Feature Engineering - Conducting Feature Extraction using count vector
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = "english")
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

print("The dimension of the training set: ", X_train_cv.toarray().shape)
print("The dimension of the test set: ", X_test_cv.toarray().shape)
print("The features : \n", cv.get_feature_names_out())

Система классификации текстов A: логистическая регрессия

Создание модели логистической регрессии для классификации текста

#Training the Logistic Regression Model - fit logistic regression model on training data and apply the model to test data.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = "lbfgs")
lr.fit(X_train_cv, Y_train)
y_pred_cv = lr.predict(X_test_cv)
y_pred_cv

Сравнение результатов тестовых данных с прогнозируемыми данными

print(list(Y_test[:10]))
print(list(y_pred_cv[:10]))

Отображение матрицы путаницы

from sklearn.metrics import confusion_matrix
lr_cm = confusion_matrix(Y_test, y_pred_cv)
lr_cm

Отображение отчета о классификации для понимания:

точность
отзывать
f1-счет

from sklearn.metrics import classification_report
lr_report = classification_report(Y_test, y_pred_cv)
print(lr_report)

Сохранение модели LR и CV

import pickle
from datetime import datetime
import os


lr_file = '/content/drive/MyDrive/Colab Notebooks/NLP_assignment/Models/lr-2022-12-10.pkl'
cv_file = '/content/drive/MyDrive/Colab Notebooks/NLP_assignment/Models/cv-2022-12-10.pkl'

with open(lr_file, 'wb') as f1:
  pickle.dump(lr,f1)

with open(cv_file, 'wb') as f2:
  pickle.dump(cv,f2)
Using the Logistic Regression

Использование модели логистической регрессии для прогнозирования настроений

#Loading the saved LR model
import os
import pickle

#open the vectorizer used to encode the training set
filename = ['lr-2022-12-10.pkl']
model_path = ['drive', 'MyDrive', 'Colab Notebooks', 'NLP_assignment', 'Models']
path1 = os.sep.join(model_path + filename)
with open(path1, 'rb') as f:
  model = pickle.load(f)

#open the vectorizer used to encode the training set
filename = ['cv-2022-12-10.pkl']
model_path = ['drive', 'MyDrive', 'Colab Notebooks', 'NLP_assignment', 'Models']


#loading the LR model
path2 = os.sep.join(model_path + filename)
with open(path2, 'rb') as f:
  trained_lr_cv = pickle.load(f)

Функция предварительной обработки текста

#pre-process new text
# remove punctutaion, remove alphanumeric, change text to lower case

import re
import string
from nltk.corpus import stopwords

def preprocess(text):
  pattern_alphanumeric = "\w*\d\w"
  pattern_punctuation = "[" + re.escape(string.punctuation) + "]"


  text = re.sub(pattern_alphanumeric, '', text)
  text = re.sub(pattern_punctuation, '', text).lower()
  
  #removing stop words
  ## defining stop words
  stop_words = stopwords.words('english')
  ### Split the text into words
  words = text.split()
  ### Filter the list of words to remove stop words
  filtered_words = [word for word in words if word not in stop_words]
  ### Join the filtered words back into a single string
  text = " ".join(filtered_words)


  return text

Тестирование модели путем ввода невидимых данных

# insert the new text to the preprocess function

# fake reviews for testing are in sequence of: negative, positive, positive, positive negative
LR_texts = ["I strongly dislike the aftertaste of k-cup. It has a strong taste of bitterness. Please add more sugar!", "The cappucinno works considerably well. It mildly improves my alertness by a lot",
            "This product is amazing!! It helps me feel much better every morning!", "BEST COFFEE EVER! i love this coffee and I highly recommend this to my coffee lovers out there! No regrets.",
            'I hate k-cup, it tastes horrible! This product is really bad']

new_text_processed = []
for i in LR_texts:
  process_text = preprocess(i)
  new_text_processed.append(process_text)

new_text_processed

применение вектора счета к невидимым данным

def encode_text_to_vector(cv, text):
  text_vector = cv.transform([text])
  return text_vector

new_text_vector= []
for text in new_text_processed:
  new_text = encode_text_to_vector(trained_lr_cv , text)
  new_text_vector.append(new_text)

Используйте модель логистической регрессии, чтобы классифицировать отзыв как положительный или отрицательный:

отображение текста, предварительно обработанного текста и результата предсказания модели

LR_predicted_label = [] 
for text_vector in new_text_vector:
  LR_predicted = (model.predict(text_vector))[0]
  LR_predicted_label.append(LR_predicted)

log_reg = {"text": LR_texts,
           "processed" : new_text_processed,
           "prediction" : LR_predicted_label}


for idx, i in enumerate(log_reg["text"]):
  print("Text: {} \nAfter processing: {} \nPrediction: {} \n".format(log_reg["text"][idx], log_reg["processed"][idx], log_reg["prediction"][idx] ))

Система классификации текстов B: Наивный Байес

Создание наивной байесовской модели для классификации текста

# Create a text classification system B using naive bayes
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

nb.fit(X_train_cv, Y_train)
y_pred_cv = nb.predict(X_test_cv)
y_pred_cv

сравнение результатов тестовых данных с прогнозируемыми данными

#comparing the results of the test data with the predicted data
print(list(Y_test[:20]))
print(list(y_pred_cv[:20]))

отображение матрицы путаницы для модели

from sklearn.metrics import confusion_matrix
nb_cm = confusion_matrix(Y_test, y_pred_cv)
nb_cm

отображение отчета о классификации модели

from sklearn.metrics import classification_report
nb_report = classification_report(Y_test, y_pred_cv)
print(nb_report)

Сохранение обученной наивной байесовской модели

#saving the model and counter vectorizer

# need to save 2 objects
# first the count vectorizer so that we can retain the vocab list and other settings to get the features.
# new text will have to be transformed through the count vectorizer

# secondly, we need to save the LR model. This will be used for prediction


import pickle
from datetime import datetime
import os

nb_file = '/content/drive/MyDrive/Colab Notebooks/NLP_assignment/Models/nb-2022-12-10.pkl'

with open(nb_file, 'wb') as f1:
  pickle.dump(nb,f1)

Загрузка обученной наивной байесовской модели

#Loading the saved nb model
import os
import pickle


filename = ['nb-2022-12-10.pkl']
model_path = ['drive', 'MyDrive', 'Colab Notebooks', 'NLP_assignment', 'Models']
path1 = os.sep.join(model_path + filename)
with open(path1, 'rb') as f:
  model = pickle.load(f)

#load the vectorizer used to encode the training set
filename = ['cv-2022-12-10.pkl']
model_path = ['drive', 'MyDrive', 'Colab Notebooks', 'NLP_assignment', 'Models']


path2 = os.sep.join(model_path + filename)
with open(path2, 'rb') as f:
  trained_nb_cv = pickle.load(f)

Создание функции предварительной обработки текста для наивной байесовской модели

def nb_preprocess(text):
  pattern_alphanumeric = "\w*\d\w"
  pattern_punctuation = "[" + re.escape(string.punctuation) + "]"


  text = re.sub(pattern_alphanumeric, '', text)
  text = re.sub(pattern_punctuation, '', text).lower()



  #removing stop words
  ## defining stop words
  stop_words = stopwords.words('english')
  ### Split the text into words
  words = text.split()
  ### Filter the list of words to remove stop words
  filtered_words = [word for word in words if word not in stop_words]
  ### Join the filtered words back into a single string
  text = " ".join(filtered_words)


  return text

# insert the new text to the preprocess function

# fake reviews for testing are in sequence of: negative, positive, positive, positive negative
NB_texts = ["I strongly dislike the aftertaste of k-cup. It has a strong taste of bitterness. Please add more sugar!", "The cappucinno works considerably well. It mildly improves my alertness by a lot",
            "This product is amazing!! It helps me feel much better every morning!", "BEST COFFEE EVER! i love this coffee and I highly recommend this to my coffee lovers out there! No regrets.",
            'I hate k-cup, it tastes horrible! This product is really bad']
nb_text_processed = []

for text in NB_texts:
  nb_process_text = nb_preprocess(text)
  nb_text_processed.append(nb_process_text)
nb_text_processed

Применение вектора счета к новым данным

#function to apply count vector
def encode_text_to_vector(cv, text):
  text_vector = cv.transform([text])
  return text_vector

# count vectorizing the preprocessed words and appending them to a new list
nb_text_vector= []
for text in nb_text_processed:
  new_text = encode_text_to_vector(trained_nb_cv , text)
  nb_text_vector.append(new_text)

Используйте наивную байесовскую модель, чтобы классифицировать обзор как положительный или отрицательный:

отображение текста, предварительно обработанного текста и результата предсказания модели

NB_predicted_label = [] 
for text_vector in nb_text_vector:
  NB_predicted = (model.predict(text_vector))[0]
  NB_predicted_label.append(NB_predicted)

naive_bayes  = {"text": NB_texts,
           "processed" : nb_text_processed,
           "prediction" : NB_predicted_label}


for idx, i in enumerate(naive_bayes["text"]):
  print("Text: {} \nAfter processing: {} \nPrediction: {} \n".format(naive_bayes["text"][idx], naive_bayes["processed"][idx], naive_bayes["prediction"][idx] ))

Система классификации текста C: случайный лес

Создание модели случайного леса

# Create a text classification system B using naive bayes
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train_cv, Y_train)
y_pred_cv = rf.predict(X_test_cv)
y_pred_cv

сравнение результатов тестовых данных с прогнозируемыми данными

#comparing the results of the test data with the predicted data
print(list(Y_test[:10]))
print(list(y_pred_cv[:10]))

отображение матрицы путаницы модели случайного леса

from sklearn.metrics import confusion_matrix
rf_cm = confusion_matrix(Y_test, y_pred_cv)
rf_cm

отображение отчета о классификации модели случайного леса

from sklearn.metrics import classification_report
rf_report = classification_report(Y_test, y_pred_cv)
print(rf_report)

сохранение обученной модели случайного леса

#saving the model and counter vectorizer

# need to save 2 objects
# first the count vectorizer so that we can retain the vocab list and other settings to get the features.
# new text will have to be transformed through the count vectorizer

# secondly, we need to save the LR model. This will be used for prediction


import pickle
from datetime import datetime
import os

rf_file = '/content/drive/MyDrive/Colab Notebooks/NLP_assignment/Models/rf-2022-12-10.pkl'

with open(rf_file, 'wb') as f1:
  pickle.dump(rf,f1)

загрузка сохраненной модели случайного леса

#Loading the saved rf model
filename = ['rf-2022-12-10.pkl']
model_path = ['drive', 'MyDrive', 'Colab Notebooks', 'NLP_assignment', 'Models']
path1 = os.sep.join(model_path + filename)
with open(path1, 'rb') as f:
  model = pickle.load(f)

#load the vectorizer used to encode the training set
filename = ['cv-2022-12-10.pkl']
model_path = ['drive', 'MyDrive', 'Colab Notebooks', 'NLP_assignment', 'Models']


path2 = os.sep.join(model_path + filename)
with open(path2, 'rb') as f:
  trained_rf_cv = pickle.load(f)

создание функции предварительной обработки для модели случайного леса для предварительной обработки текста

def rf_preprocess(text):
  pattern_alphanumeric = "\w*\d\w"
  pattern_punctuation = "[" + re.escape(string.punctuation) + "]"


  text = re.sub(pattern_alphanumeric, '', text)
  text = re.sub(pattern_punctuation, '', text).lower()



  #removing stop words
  ## defining stop words
  stop_words = stopwords.words('english')
  ### Split the text into words
  words = text.split()
  ### Filter the list of words to remove stop words
  filtered_words = [word for word in words if word not in stop_words]
  ### Join the filtered words back into a single string
  text = " ".join(filtered_words)


  return text

Получение списка невидимых данных для тестирования модели

# insert the new text to the preprocess functionb
# fake reviews for testing are in sequence of: negative, positive, positive, positive negative
RF_texts = ["I strongly dislike the aftertaste of k-cup. It has a strong taste of bitterness. Please add more sugar!", "The cappucinno works considerably well. It mildly improves my alertness by a lot",
            "This product is amazing!! It helps me feel much better every morning!", "BEST COFFEE EVER! i love this coffee and I highly recommend this to my coffee lovers out there! No regrets.",
            'I hate k-cup, it tastes horrible! This product is really bad']

rf_text_processed = []

for text in RF_texts:
  rf_process_text = rf_preprocess(text)
  rf_text_processed.append(rf_process_text)

создание функции для применения векторизации счета к данным

def encode_text_to_vector(cv, text):
  text_vector = cv.transform([text])
  return text_vector

rf_text_vector= []
for text in rf_text_processed:
  new_text = encode_text_to_vector(trained_rf_cv , text)
  rf_text_vector.append(new_text)

Используйте модель случайного леса, чтобы классифицировать отзыв как положительный или отрицательный:

отображение текста, предварительно обработанного текста и результата предсказания модели

RF_predicted_label = [] 
for text_vector in rf_text_vector:
  RF_predicted = (model.predict(text_vector))[0]
  RF_predicted_label.append(RF_predicted)

random_forest = {"text": RF_texts,
           "processed" : rf_text_processed,
           "prediction" : RF_predicted_label}


for idx, i in enumerate(naive_bayes["text"]):
  print("Text: {} \nAfter processing: {} \nPrediction: {} \n".format(random_forest["text"][idx], random_forest["processed"][idx], random_forest["prediction"][idx] ))

Общая оценка производительности модели классификации текста

#Logistic Regression Evaluation
print("Model A: Logistic Regression\n\n=======================\n Confusion matrix : \n\n{} \n\n\n=======================\n Classification Report:\n\n{}".format(lr_cm, lr_report))
print("Sequence of randomly generated reviews are classified as:\nnegative\npositive\npositive\npositive\nnegative\n ")
print("The model managed to accurately predict: ")
for i in log_reg["prediction"]:
  print(i)

#Naive Bayes Evaluation
print("Model B: Naive Bayes\n\n=======================\n Confusion matrix : \n\n{} \n\n\n=======================\n Classification Report:\n\n{}".format(nb_cm, nb_report))
print("Sequence of randomly generated reviews are classified as:\nnegative\npositive\npositive\npositive\nnegative\n ")
print("The model managed to accurately predict: ")
for i in naive_bayes["prediction"]:
  print(i)

#Random Forest Evaluation
print("Model C: Random Forest\n\n=======================\n Confusion matrix : \n\n{} \n\n\n=======================\n Classification Report:\n\n{}".format(rf_cm, rf_report))
print("Sequence of randomly generated reviews are classified as:\nnegative\npositive\npositive\npositive\nnegative\n ")
print("The model managed to accurately predict: ")
for i in naive_bayes["prediction"]:
  print(i)

Окончательная оценка:

Сравнение трех разных моделей вместе: A, B и C.

Оценка моделей на основе подачи невидимых данных для составления прогноза: модель B способна точно делать правильные прогнозы только в отношении тональности 4 текста в списке случайно сгенерированных обзоров, а также A который смог правильно предсказать только 4, в то время как C, с другой стороны, не смог выполнить точное предсказание ни для одного (как показано выше).

Оценка на основе матрицы путаницы и отчета о классификации:
На самом деле лучшей моделью является B, Наивный Байес, хотя модели A и B почти одинаковы. Причина в том, что B имеет немного более высокую точность, что означает, что он может принимать более точные решения по невидимым данным по сравнению с A (как показано выше).

Раскрытие идей: НЛП-анализ обзоров Amazon

Понимание данных

Предварительная обработка текста

Увеличение ширины столбцов для просмотра данных ¶

Проверка набора данных на отсутствующие и нулевые значения ¶

Предварительная обработка данных для очистки

Лемматизация и токенизация¶

Токенизация очищенного текста в униграммы, биграммы и триграммы

Понимание данных

Лучшие Unigrams, найденные в обзорах:

Лучшие биграммы в обзорах:

Лучшие слова триграммы

Сохранение очищенного набора данных

Удаление отзывов с 3 звездами

Разделение данных на тестовые и обучающие

Разработка функций

Система классификации текстов A: логистическая регрессия

Система классификации текстов B: Наивный Байес

Система классификации текста C: случайный лес

Общая оценка производительности модели классификации текста

Окончательная оценка:

Вопросы по теме