Categorize Yelp Reviews

4 minute read

Objective: Given the Yelp reviews dataset, classify the reviews into 1 star or 5 star categories based off the text content in the reviews.

Source: Udemy | Python for Data Science and Machine Learning Bootcamp
Data used in the below analysis: Yelp Review Data Set from Kaggle.

#importing libraries to be used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#view plots in jupyter notebook
%matplotlib inline
sns.set_style('whitegrid') #setting style for plots, optional

yelp = pd.read_csv('yelp.csv')

#viewing the dataset
yelp.head(3)

	business_id	date	review_id	stars	text	type	user_id	cool	useful
0	9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for breakf...	review	rLtl8ZkDX5vH5nAx9C3q5Q	2	5
1	ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad review...	review	0a2KyEL0d3Yb1V6aivbIuQ	0	0
2	6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I als...	review	0hT2KtfLiobPvh6cDC8JQg	0	1

yelp.describe()

	stars	cool	useful	funny
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	3.777500	0.876800	1.409300	0.701300
std	1.214636	2.067861	2.336647	1.907942
min	1.000000	0.000000	0.000000	0.000000
25%	3.000000	0.000000	0.000000	0.000000
50%	4.000000	0.000000	1.000000	0.000000
75%	5.000000	1.000000	2.000000	1.000000
max	5.000000	77.000000	76.000000	57.000000

yelp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   business_id  10000 non-null  object
 1   date         10000 non-null  object
 2   review_id    10000 non-null  object
 3   stars        10000 non-null  int64
 4   text         10000 non-null  object
 5   type         10000 non-null  object
 6   user_id      10000 non-null  object
 7   cool         10000 non-null  int64
 8   useful       10000 non-null  int64
 9   funny        10000 non-null  int64
dtypes: int64(4), object(6)
memory usage: 781.4+ KB

yelp['text length'] = yelp['text'].apply(len) #getting the length of the reviews

yelp.head(3)

	business_id	date	review_id	stars	text	type	user_id	cool	useful	text length
0	9yKzy9PApeiPPOUJEtnvkg	2011-01-26	fWKvX83p0-ka4JS3dc6E5A	5	My wife took me here on my birthday for breakf...	review	rLtl8ZkDX5vH5nAx9C3q5Q	2	5	889
1	ZRJwVLyzEJq1VAihDhYiow	2011-07-27	IjZ33sJrzXqU-0X6U8NwyA	5	I have no idea why some people give bad review...	review	0a2KyEL0d3Yb1V6aivbIuQ	0	0	1345
2	6oRAC4uyJCsJl1X0WZpVSA	2012-06-14	IESLBzqUCLdSzSqm0eCSxQ	4	love the gyro plate. Rice is so good and I als...	review	0hT2KtfLiobPvh6cDC8JQg	0	1	76

Some EDA (Explolatory data analysis)

g = sns.FacetGrid(data=yelp,col='stars')
g.map(plt.hist,'text length')

ratings vs review length

plt.figure(figsize=(12,5))
sns.boxplot(x='stars',y='text length',data=yelp)

ratings vs review length

plt.figure(figsize=(10,5))
sns.countplot(yelp['stars'])

stars count

#Group by and take average
yelp_grp = yelp.groupby(by='stars').mean()
yelp_grp

	cool	useful	funny	text length
stars
1	0.576769	1.604806	1.056075	826.515354
2	0.719525	1.563107	0.875944	842.256742
3	0.788501	1.306639	0.694730	758.498289
4	0.954623	1.395916	0.670448	712.923142
5	0.944261	1.381780	0.608631	624.999101

yelp_grp.corr() #check correlation

	cool	useful	funny	text length
cool	1.000000	-0.743329	-0.944939	-0.857664
useful	-0.743329	1.000000	0.894506	0.699881
funny	-0.944939	0.894506	1.000000	0.843461
text length	-0.857664	0.699881	0.843461	1.000000

sns.heatmap(yelp_grp.corr(),cmap='coolwarm',annot=True)

correlation between numeric features

NLP Classification

#working with only 1 or 5 star reviews
yelp_class = yelp[(yelp.stars == 1) | (yelp.stars == 5)]

X = yelp_class['text']
y = yelp_class['stars']

#import countvector to convert text into vectors
from sklearn.feature_extraction.text import CountVectorizer
CV = CountVectorizer()

X = CV.fit_transform(X)

#splitting data into train test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

nb.fit(X_train,y_train)

MultinomialNB()

predict = nb.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test,predict))

[[159  69]
 [ 22 976]]

print(classification_report(y_test,predict))

              precision    recall  f1-score   support

           1       0.88      0.70      0.78       228
           5       0.93      0.98      0.96       998

    accuracy                           0.93      1226
   macro avg       0.91      0.84      0.87      1226
weighted avg       0.92      0.93      0.92      1226

Applying some text pre-processing

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('bow', CountVectorizer()),
                    ('Tf-IDF weights', TfidfTransformer()),('Naive Bayes classifier', MultinomialNB())])

Since all the pre-processing is now incorporated within the pipeline, we need to re-create the train test split

X = yelp_class['text']
y = yelp_class['stars']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

pipeline.fit(X_train,y_train)

Pipeline(steps=[('bow', CountVectorizer()),
                ('Tf-IDF weights', TfidfTransformer()),
                ('Naive Bayes classifier', MultinomialNB())])

n_predict = pipeline.predict(X_test)

print(confusion_matrix(y_test,n_predict))

[[  0 228]
 [  0 998]]

print(classification_report(y_test,n_predict))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00       228
           5       0.81      1.00      0.90       998

    accuracy                           0.81      1226
   macro avg       0.41      0.50      0.45      1226
weighted avg       0.66      0.81      0.73      1226



/Users/vanya/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

It seems like incorporating TF-IDF in our dataset made things pretty bad. We can try to use a different classifier with TF-IDF or not use TF-IDF at all like before!

Replacing the Naive Bayes with SVM

from sklearn.svm import SVC

pipeline = Pipeline([('bow', CountVectorizer()),
                    ('Tf-IDF weights', TfidfTransformer()),('SVM classifier', SVC())])

pipeline.fit(X_train,y_train)
n_predict = pipeline.predict(X_test)

print(confusion_matrix(y_test,n_predict))

[[134  94]
 [  6 992]]

#Comparing SVM and TF-IDF results with Naive Bayes without TF-IDF
print("SVM and TF-IDF")
print(classification_report(y_test,n_predict))
print('Naive Bayes \n')
print(classification_report(y_test,predict))

SVM and TF-IDF
              precision    recall  f1-score   support

           1       0.96      0.59      0.73       228
           5       0.91      0.99      0.95       998

    accuracy                           0.92      1226
   macro avg       0.94      0.79      0.84      1226
weighted avg       0.92      0.92      0.91      1226

Naive Bayes

              precision    recall  f1-score   support

           1       0.88      0.70      0.78       228
           5       0.93      0.98      0.96       998

    accuracy                           0.93      1226
   macro avg       0.91      0.84      0.87      1226
weighted avg       0.92      0.93      0.92      1226

Result: We were able to train and categorize reviews to 1 or 5 star ratings.

Our dataset didn’t perform well with Naive Bayes and TF-IDF processing, though applying SVM with TF-IDF gave us better precision. Only using the Naive Bayes classifier with text pre-processing also gave us great results, with a better recall rate than the former.

It depends which metric needs to be improved upon and accordingly we can choose the model to apply!

Share on

Twitter Facebook Google+ LinkedIn

Vanya Sahu

Categorize Yelp Reviews

Objective: Given the Yelp reviews dataset, classify the reviews into 1 star or 5 star categories based off the text content in the reviews.

Some EDA (Explolatory data analysis)

NLP Classification

Applying some text pre-processing

Replacing the Naive Bayes with SVM

Result: We were able to train and categorize reviews to 1 or 5 star ratings.

Our dataset didn’t perform well with Naive Bayes and TF-IDF processing, though applying SVM with TF-IDF gave us better precision. Only using the Naive Bayes classifier with text pre-processing also gave us great results, with a better recall rate than the former.

It depends which metric needs to be improved upon and accordingly we can choose the model to apply!

Share on

You May Also Enjoy

Determine neighbourhood to open new restaurant using clustering

Predict Movie Collections

Predict Loan Repayment

Predict House price