Sentiment Analytics on Amazon Reviews

Text analytics is the process of deriving high-quality information from text to observe patterns and trends in it. In the present age of social media where a lot of data is generated every minute, it is important to derive useful information from the data.


There are 3 levels of text analytics:

Text

Text-based applications exploit the visceral components of text, i.e. words, phrases, document titles etc.

The main role of analytics is to convert text into information

This is done by classifying text, or summarizing it so as to reduce it to its main elements

Analytics may even be used to discard irrelevant text, thereby condensing it into information with higher signal content


Content

Content expands the domain of text to images, time, form of text (email, blog, page), format(html, xml, etc.), source, etc.

Text becomes enriched with content and asserts quality and veracity that may be exploited in analytics.

For example, financial information has more value when streamed from Dow Jones, versus a financial blog.

Inherited meanings like sentiment could also be derived.


Context

Context refers to relationships between information items.

Closely monitoring related entities and classifying information accordingly improves the relevance of the information gathered.


Complexity vs. Quantity



The complexity of the algorithm increases as we try to derive inherent meanings like sentiment and relationships between information items.

Sources of text:

There are several sources to collect the data like

Real-time data collected from social media by using data aggregators like Datasift, GNIP

Data from open-web collected through HTML parsing using tools like python, R

Data collected from proprietary databases of companies


Text Analytics using Python

Data can be collected from websites through HTML parsing. For example, the reviews of a product from amazon.com can be collected and used for analysis of customers’ opinions of that product. The entire code is available here to follow along with the examples below




Consider Interstellar on amazon.com to analyse customers reviews posted on the website. Here we pull the reviews then perform analysis using python.

Web-Scraping:

The URL of reviews on amazon consists of term “pageNumber=” which is helpful in iterating over multiple review pages using for loop. Go to the 2nd review page and copy the URL as the 1st page does not include the term “pageNumber=”.

url = "https://www.amazon.com/Interstellar-Matthew-McConaughey/product-reviews/B00TU9UO1W/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2"

Create a function to download reviews from amazon, say amazon()


  1. Above function takes the url and n - number of pages to pull the reviews from as the arguments

  2. Initially, an empty array all_reviews is created and reviews are added to this document for every iteration of for loop

  3. The HTML of amazon.com consists of class “a-size-base review-text review-text-content” in which the review text is placed. The reviews can be pulled using this class attribute

  4. BeautifulSoup is the library used to extract the reviews from html.

  5. Finally, the all_reviews array is returned by the function.

Install html5lib using pip before using the above function


reviews_list = amazon(url,10)
len(reviews_list)

The above extracts the reviews from 10 pages. Each page has 10 reviews. So, a total of 100 reviews are extracted.


Text Analytics in Python:

Load the required libraries and convert the reviews to a dataframe.

import nltk nltk.download() import pandas as pd import re # Dealing with Regular Expressions. import matplotlib.pyplot as plt # plots. from wordcloud import WordCloud from nltk.tokenize import word_tokenize from sklearn.feature_extraction.text import CountVectorizer # Representation of Text as FEATURES (Document Term Matrix). from sklearn.feature_extraction.text import TfidfVectorizer # Representation of Text as FEATURES (TFIDF-Document Term Matrix). from senticnet.senticnet import SenticNet raw_reviews = pd.DataFrame({'reviews': reviews_list}) raw_reviews.shape # examine dimensions/shape of dataframe. raw_reviews.head(10)

Output:

>>> raw_reviews.shape # examine dimensions/shape of dataframe. (100, 1) >>> raw_reviews.head(10) # examine first n (i.e 10 in this case) rows of dataframe 1 Before I get started let me add this disclaime... 2 This is an amazing movie, and visually one of ... 3 This review is focused on the 4K Ultra-HD Blu-... 4 Interstellar was, as most know was a film by C... 5 The Paramount Pictures movie entitled “Interst... 6 4K is so worth it. You don’t need another movi... 7 What makes Interstellar a great movie is the m... 8 How do you make a sci-fi movie, but also keep ... 9 I was glued to the movie and recommend it. As ... 10 Ugh. I ordered this by accident but the effort...

Text Cleaning:

The reviews which are pulled from amazon may contain lot of characters or words that we don’t need in the analysis. The content should be cleaned from these unnecessary characters.

❌ Remove HTML tags

❌ Remove punctuation marks

❌ Remove numbers

❌ Remove white space

✔️ Keep only alpha-numeric

✔️ Keep only ASCII characters

def text_clean_one(): for i in range(0, len(raw_reviews.reviews), 1): raw_reviews['reviews'].iloc[i] = re.sub("RT @[\w_]+: ", "", raw_reviews['reviews'].iloc[i]) # Removes RT @<username>: raw_reviews['reviews'].iloc[i] = re.sub("<.*?>", "", raw_reviews['reviews'].iloc[i]) # Removes HTML tags. raw_reviews['reviews'].iloc[i] = re.sub(r'[^\x00-\x7F]+', ' ', raw_reviews['reviews'].iloc[i]) # only ascii raw_reviews['reviews'].iloc[i] = re.sub(' +', ' ', raw_reviews['reviews'].iloc[i]) # replacing spaces to single space raw_reviews['reviews'].iloc[i] = raw_reviews['reviews'].iloc[i].lower() # converting to lower case raw_reviews['reviews'].iloc[i] = re.sub("[^\w\s]", "", raw_reviews['reviews'].iloc[i]) # Removes punctuations raw_reviews['reviews'].iloc[i] = re.sub('[^0-9a-zA-Z ]+', "", raw_reviews['reviews'].iloc[i]) # Keeps only alphanumeric return raw_reviews ################# end of function ################################## raw_reviews.head(10) # Before cleaning the data. clean_reviews = text_clean_one() # Cleaning Function. clean_reviews.head(10) # Examine data after cleaning. len(clean_reviews)

Output:

>>> clean_reviews.head(10) # Examine data after cleaning. 1 before i get started let me add this disclaime... 2 this is an amazing movie and visually one of t... 3 this review is focused on the 4k ultrahd blura... 4 interstellar was as most know was a film by ch... 5 the paramount pictures movie entitled interste... 6 4k is so worth it you dont need another movie ... 7 what makes interstellar a great movie is the m... 8 how do you make a scifi movie but also keep it... 9 i was glued to the movie and recommend it as a... 10 ugh i ordered this by accident but the effort ... >>> len(clean_reviews) 100

Stopwords:

Stopwords are commonly used words such as “the”, “and”, “an” which are ignored in text analysis. In the case of the movie Interstellar, reviews contain repeated words like “movie”, “film” which are not useful. So, all the words to be omitted are manually added to a text file (say stopwords.txt) and excluded along with some predefined stopwords.




stopwords_user_file = open("C:\\DSA\\stopwords.txt") stopwords_user = set(stopwords_user_file.read().split()) # reading words from the file. from nltk.corpus import stopwords stopwords_english = set(stopwords.words('english')) # inbuilt english stop words. stopwords = list(stopwords_user.union(stopwords_english)) # Unique words from both the lists. len(stopwords)

Output:

>>> len(stopwords) 640

Removing stopwords from clean_reviews

clean_reviews['without_stopwords'] = clean_reviews['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)])) clean_reviews_final = pd.DataFrame(clean_reviews.without_stopwords) # Dataframe with cleaned_reviews & removed stopwords. clean_reviews_final.head(5) len(clean_reviews_final)

Output:

>>> clean_reviews_final.head(5) without_stopwords 1 started add disclaimer watch entertained dont ... 2 amazing visually 4k effect postproduction 4k 4... 3 review focused 4k ultrahd disc viewed 65 lg ol... 4 batman franchise plot based premise point dist... 5 paramount pictures entitled epic science ficti... >>> len(clean_reviews_final) 100

Removal of empty reviews (documents):

for j in range(1, len(clean_reviews_final), 1): if len(word_tokenize(str(clean_reviews_final.without_stopwords[j]))) < 1: clean_reviews_final = clean_reviews_final.drop([j]) len(clean_reviews_final)
Output:
>>> len(clean_reviews_final) 100

Document-term matrix:

A document-term or term-document matrix consists of frequency of terms that exist in a collection of documents. In document-term matrix, rows represent documents in the collection and columns represent terms whereas the term-document matrix is the transpose of it.



The DTM - Document Term Matrix can be created using the CountVectorizer() of scikit-learn library

clean_reviews_series = clean_reviews_final.without_stopwords #vectorizer needs a series object. vectorizer = CountVectorizer() # Initiating CountVectorizer. ( with Default Parameters) document_term_matrix = vectorizer.fit_transform(clean_reviews_series) # DOCUMENT-TERM-MATRIX document_term_matrix = pd.DataFrame(document_term_matrix.toarray(), columns=vectorizer.get_feature_names()) # DTM to Dataframe. document_term_matrix.shape document_term_matrix.head(10)
Output:
>>> document_term_matrix.shape (100, 1958) >>> document_term_matrix.head(10) 100 1080p 11 12 15 20 ... youll young youre youve yr zimmer 0 1 0 0 0 0 0 ... 0 0 2 0 0 0 1 1 0 0 0 1 0 ... 0 1 0 0 0 0 2 0 1 0 0 0 0 ... 0 0 0 0 0 0 3 0 0 0 0 0 0 ... 0 0 0 0 0 0 4 0 0 0 0 0 0 ... 0 0 0 0 0 0 5 0 0 0 0 0 0 ... 0 0 0 0 0 0 6 0 0 0 0 0 0 ... 0 0 0 0 0 0 7 0 0 0 0 0 0 ... 0 0 0 0 0 0 8 0 0 0 0 0 0 ... 0 0 0 0 0 0 9 0 0 0 0 0 0 ... 0 0 0 0 0 0 [10 rows x 1958 columns]

Word Cloud:

A word cloud (tag cloud, or weighted list in visual design) is a visual representation of text data, typically used to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms.



Word cloud using frequencies of words:

words = dict(document_term_matrix.apply(sum , axis = 0)) ## this needs a dictionary object wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(words) # fit_words() is used to plot wordcloud using dictionary. plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()

Output:


From the above word cloud, we can observe the most frequently used words in the reviews of Interstellar such as “story”, “space”, “time”, “science” etc.

N-Gram Analysis:

In text analysis, sometimes a word alone does not make sense but the word followed by the next word together may explain something, in such cases using the 2 words together is preferable. N-Gram means a continuous sequence of N-words in a given text or document.

An n-gram of size 1 (1 word) is called unigram, size 2 (2 words) is called bigram.

Consider the text “Interstellar is a 2014 film”

Unigrams: “Interstellar” “is” “a” “2014” “film”

Bigrams: “Interstellar is” “a 2014”

“is a” “2014 film”

Trigrams: “Interstellar is a”

“is a 2014”

“a 2014 film”


Word cloud using frequencies of unigram, bigram words:

vectorizerng = CountVectorizer(ngram_range=(1,2),min_df=0.01) # one and two grams( i.e unigrams and bigrams). document_term_matrix_ng = vectorizerng.fit_transform(clean_reviews_series) # DOCUMENT-TERM-MATRIX document_term_matrix_ng = pd.DataFrame(document_term_matrix_ng.toarray(), columns=vectorizerng.get_feature_names()) # DTM to Dataframe. #word cloud frequencies of words words = dict(document_term_matrix_ng.apply(sum , axis = 0)) wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(words) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()



In the above word cloud, several unigrams along with few bigrams like black hole, science fiction, special effects can be observed.


TF-IDF (Term Frequency - Inverse Document Frequency)

Simply calculating the frequency of terms as in document-term matrix suffers from a critical problem, all terms are considered equally important when it comes to assessing relevancy on a query.

As an example, a collection of documents on the auto industry is likely to have the term auto in almost every document. To end this, a mechanism can be introduced for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance determination. This can be solved by scaling down the weights of terms with high collection frequency.

Term Frequency(TF) is the ratio of number of times a word occured in a document to the total number of words in the document.

Inverse Document Frequency(IDF) is the logarithm of (total number of documents divided by number of documents containing the word).




The product of TF and IDF gives the TF-IDF. In other words, we assign to term ‘t’ a weight in the document d that is

Highest when t occurs many times within a small number of documents (thus lending high discriminating power to these documents)

Lower when the term occurs in many documents (thus offering a less pronounced relevance signal)

Lowest when the term occurs virtually in all documents


Word cloud using TF-IDF of words:

# Create DTM- with TF-IDF vectorizeridf = TfidfVectorizer() # Initiating CountVectorizer. ( with Default Parameters) document_term_matrix_idf = vectorizeridf.fit_transform(clean_reviews_series) # DOCUMENT-TERM-MATRIX document_term_matrix_idf = pd.DataFrame(document_term_matrix_idf.toarray(), columns=vectorizeridf.get_feature_names()) # DTM to Dataframe. document_term_matrix_idf.shape document_term_matrix_idf.head(10) # wordcloud using TFIDF of words words = dict(document_term_matrix_idf.apply(sum , axis = 0)) ## this needs an dictionary object wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(words) # fit_words() is used to plot wordcloud using dictionary. plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()

Sentiment Mining:

Sentiment is a thought , view, or attitude, especially one based mainly on emotion instead of reason.

Sentiment Analysis or opinion mining is the use of natural language processing (NLP) and computational techniques to automate the extraction or classification of sentiment from typically unstructured text.

Words (unigrams), phrases/n-grams, sentences can be used as the features for sentiment analysis. The features can be interpreted for sentiment detection using

Bag of words: It is a model representation used in NLP and Information Retrieval (IR) in which text is represented as a bag of words without word order.

Annotated lexicons from resources like WordNet, SentiWordNet

Syntactic patterns

Challenges of sentiment mining:

Difficulty in recognising subtlety of sentiment expression like irony or sarcasm

Word/phrases can mean different things in different contexts and domains

Effect of syntax on semantics


Word clouds with positive and negative words:

sn = SenticNet() positive_words=[] negative_words = [] for word in vectorizer.get_feature_names(): if word in sn.data: if sn.polarity_value(word) == 'positive': positive_words.append(word) if sn.polarity_value(word) == 'negative': negative_words.append(word) len(positive_words) len(negative_words) positive_words = dict(document_term_matrix[positive_words].apply(sum , axis = 0)) negative_words = dict(document_term_matrix[negative_words].apply(sum , axis = 0))
>>> len(positive_words) 832 >>> len(negative_words) 315
# positive words wordcloud using frequency of words wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(positive_words) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()


People who liked the movie used the words like enjoy, great, loved, amazing to review the movie.


# Negative words wordcloud using frequency of words wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(negative_words) # wordcloud using frequencies ( this needs an dictionary object) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()

The terms bad, difficult, slow, bad can be observed which indicates a negative review.



Business Lens:

Consider a scenario where a company has sold its television product on amazon and now wants to analyze customer reviews to understand their opinions, both positive and negative.


The following television is chosen for this analysis amazon.com_television

First, let’s collect the reviews and convert them to dataframe.

url = "https://www.amazon.com/Sceptre-E246BD-SMQK-24-0-Combination-Black/product-reviews/B01JY3ND8Y/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2" reviews_list = amazon.com(url,10) raw_reviews = pd.DataFrame({'reviews': reviews_list}) raw_reviews.shape # examine dimensions/shape of dataframe. raw_reviews.head(10)

Output:
>>> raw_reviews.shape # examine dimensions/shape of dataframe. (100, 1) >>> raw_reviews.head(10) # examine first n (i.e 10 in this case) rows of dataframe reviews 1 I bought this tv with the built in dvd player ... 2 A lot of people have said that the DVD player ... 3 If you are one of those people who likes high-... 4 Bought this Sceptre 720p DVD combo TV for $119... 5 After reading all the reviews, it seems I’m th... 6 Bought this for father who is in a nursing hom... 7 Like it very much. Picture is very clear and s... 8 I purchased this tv for the sole purpose of us... 9 am very happy with this tv. i am not much of a... 10 Very convenient and compact. Has a host of opt...

Clean the reviews (Remove HTML tags, punctuation, extra spaces etc.

def text_clean_one(): for i in range(0, len(raw_reviews.reviews), 1): raw_reviews['reviews'].iloc[i] = re.sub("<.*?>", "", raw_reviews['reviews'].iloc[i]) # Removes HTML tags. raw_reviews['reviews'].iloc[i] = re.sub(r'[^\x00-\x7F]+', ' ', raw_reviews['reviews'].iloc[i]) #only ascii raw_reviews['reviews'].iloc[i] = re.sub(' +', ' ', raw_reviews['reviews'].iloc[i]) # replacing spaces to single space raw_reviews['reviews'].iloc[i] = raw_reviews['reviews'].iloc[i].lower() # converting to lower case raw_reviews['reviews'].iloc[i] = re.sub("[^\w\s]", "", raw_reviews['reviews'].iloc[i]) # Removes punctuations raw_reviews['reviews'].iloc[i] = re.sub('[^0-9a-zA-Z ]+', "", raw_reviews['reviews'].iloc[i]) # Keeps only alphanumeric return raw_reviews clean_reviews = text_clean_one() # Cleaning Function. clean_reviews.head(10) # Examine data after cleaning. len(clean_reviews)

Output:
>>> clean_reviews.head(10) # Examine data after cleaning. reviews 1 i bought this tv with the built in dvd player ... 2 a lot of people have said that the dvd player ... 3 if you are one of those people who likes highq... 4 bought this sceptre 720p dvd combo tv for 1199... 5 after reading all the reviews it seems i m the... 6 bought this for father who is in a nursing hom... 7 like it very much picture is very clear and sh... 8 i purchased this tv for the sole purpose of us... 9 am very happy with this tv i am not much of a ... 10 very convenient and compact has a host of opti... >>> len(clean_reviews) 100

Remove stopwords using inbuilt english stopwords and also manually create stopwords.txt file if necessary. Building stopwords is an iterative process, as we plot word clouds, new words that seem unnecessary can be added to the stopwords file.

# stopwords list preparation. # locate user defined stopwords list file. stopwords_user_file = open("C:\\stopwords_tv.txt") stopwords_user = set(stopwords_user_file.read().split()) # reading words from the file. stopwords_english = set(stopwords.words('english')) # inbuilt english stop words. stopwords = list(stopwords_user.union(stopwords_english)) # Unique words from both the lists. len(stopwords) # removing stopwords from clean_reviews. clean_reviews['without_stopwords'] = clean_reviews['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)])) clean_reviews_final = pd.DataFrame(clean_reviews.without_stopwords) # Dataframe with cleaned_reviews & removed stopwords. clean_reviews_final.head(5) len(clean_reviews_final)

Output:
>>> len(stopwords) 191 >>> clean_reviews_final.head(5) without_stopwords 1 bought tv built dvd player rv knew went campin... 2 lot people said dvd player doesnt work well th... 3 one people likes highquality sound digital res... 4 bought sceptre 720p dvd combo tv 11999 wallmou... 5 reading reviews seems one encounter problem or... >>> len(clean_reviews_final) 100

Remove empty reviews:

# Removal of Empty Reviews(Documents) for j in range(1, len(clean_reviews_final), 1): if len(word_tokenize(str(clean_reviews_final.without_stopwords[j]))) < 1: clean_reviews_final = clean_reviews_final.drop([j]) len(clean_reviews_final)
>>> len(clean_reviews_final) 100

Create Document Term Matrix (DTM) with unigrams and bigrams:

clean_reviews_series = clean_reviews_final.without_stopwords #vectorizer needs a series object vectorizerng = CountVectorizer(ngram_range=(1,2),min_df=0.01) # one and two grams( i.e unigrams and bigrams). document_term_matrix_ng = vectorizerng.fit_transform(clean_reviews_series) # DOCUMENT-TERM-MATRIX document_term_matrix_ng = pd.DataFrame(document_term_matrix_ng.toarray(), columns=vectorizerng.get_feature_names()) # DTM to Dataframe. document_term_matrix_ng.shape document_term_matrix_ng.head(10)

Output:
>>> document_term_matrix_ng.shape (100, 3814) >>> document_term_matrix_ng.head(10) 10 10 stars 100 100 bad ... youre youre buying zero zero last 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 2 0 0 0 0 ... 0 0 0 0 3 0 0 0 0 ... 0 0 0 0 4 0 0 0 0 ... 0 0 0 0 5 0 0 0 0 ... 0 0 0 0 6 0 0 0 0 ... 0 0 0 0 7 0 0 0 0 ... 0 0 0 0 8 0 0 0 0 ... 0 0 0 0 9 1 1 0 0 ... 0 0 0 0 [10 rows x 3814 columns]

Create word cloud with above DTM:

# word cloud frequencies of words words = dict(document_term_matrix_ng.apply(sum , axis = 0)) ## this needs an dictionary object wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(words) # fit_words() is used to plot wordcloud using dictionary. plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()


The words like tv, amazon, im, use are not useful in this context. So, they can be added to the stopwords list and then a new word cloud is created. This process is repeated until all unnecessary words are eliminated.

Word cloud after including few more stop words.



Many customers mentioned “dvd” in their reviews, if we check the product description and few reviews, it can be noticed that there is an inbuilt dvd player in this television and many customers are complaining that the dvd player is not functioning properly.


DTM with TF-IDF:

vectorizeridf = TfidfVectorizer() # Initiating CountVectorizer. ( with Default Parameters) document_term_matrix_idf = vectorizeridf.fit_transform(clean_reviews_series) # DOCUMENT-TERM-MATRIX document_term_matrix_idf = pd.DataFrame(document_term_matrix_idf.toarray(), columns=vectorizeridf.get_feature_names()) # DTM to Dataframe. document_term_matrix_idf.shape document_term_matrix_idf.head(10)

Output

>>> document_term_matrix_idf.shape (100, 1074) >>> document_term_matrix_idf.head(10) 10 100 10000 1080p 11 ... year yet yo youre zero 0 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 3 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 4 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 5 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 6 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 7 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 8 0.00000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 9 0.15347 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 [10 rows x 1074 columns]

Word cloud using TF-IDF:

# wordcloud using TFIDF of words words = dict(document_term_matrix_idf.apply(sum , axis = 0)) ## this needs an dictionary object wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(words) # fit_words() is used to plot wordcloud using dictionary. plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()


Again unnecessary words can be omitted using stopwords list and the process is repeated to create more useful word clouds.


As the word freezes is observed in the word cloud, after checking for word “freezes” in the excel sheet of reviews collected, many customers has problem with inbuilt dvd player freezing frequently. Few customers also have problems like sound, picture quality but they still liked it because of the low price range.

Sentiment Analysis:

sn = SenticNet() positive_words=[] negative_words = [] for word in vectorizer.get_feature_names(): if word in sn.data: if sn.polarity_value(word) == 'positive': positive_words.append(word) if sn.polarity_value(word) == 'negative': negative_words.append(word) len(positive_words) len(negative_words) positive_words = dict(document_term_matrix[positive_words].apply(sum , axis = 0)) negative_words = dict(document_term_matrix[negative_words].apply(sum , axis = 0))

Output

>>> len(positive_words) 467 >>> len(negative_words) 185

Positive word cloud:

Word cloud of positive words using frequency of words

wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(positive_words) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()


The words good, nice, excellent, great indicate positive opinion of customers but this word cloud is not a good indicator of positive words as the words dvd, remote, sound, picture are used here to complain about their quality.


Negative word cloud:

wordcloud = WordCloud(max_font_size=40,max_words = 50,background_color = "white").fit_words(negative_words) # wordcloud using frequencies ( this needs a dictionary object) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()


The negative words like defective, freeze, horrible, difficult, poor are used by the customers. After checking these terms in the reviews sheet, most of the customers are facing issues with in built dvd player. So, it can be improved in the future products to control the backlash of customers. The problem with sentiment analysis is it is difficult to understand sarcastic reviews, so it should be used with care.




21 views

FOLLOW US ON

  • Facebook Social Icon
  • Twitter Social Icon
  • LinkedIn Social Icon

ADDRESS

Gachibowli, Hyderabad, Telangana, India

©2019  Data Science Authority