Glassdoor Company Review Analysis - Part 1 Data Cleaning & EDA

Dataset

This dataset is about a large company's review on Glassdoor. We web scrapped the data from Glassdoor with its written permission. There are 6675 reviews with numerical ratings and text comments. To make this analysis general, the company name has been replaced with "Anonymous".

Problem Statement

The goal of this analysis is to analyze anonymous employee reviews submitted on Glassdoor, with the hope to help employers gain real insights on their employee engagement. The analysis will answer these questions: what employees like and dislike about their company? Has the company’s reputation gotten better or worse in the recent year? Which job families have the highest and lowest satisfaction rates? What are the keywords that people say about this company? What can this company do to improve employee engagement?

Introduction

This is part 1 of Glassdoor company review analysis project. In this part, we are going to extract categorical information from existing data, make unstructured data into structured data, clean data, and explore numerical data with visualization.

Data Importing

In [1]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
In [2]:
# use the first column as index, parse 'Comment Datetime' to datetime object

df = pd.read_csv('company1_review.csv', index_col=[0], parse_dates=['Comment Datetime']) 
df.head()
Out[2]:
Con Career Opportunities Author Location Summary Senior Management Pro Recommendation Culture & Values Author Years Overall Rating Compensation and Benefits Author Title Work/Life Balance Comment Datetime
0 - potentially slow upward movement beyond the ... 5.0 Salt Lake City, UT "Awesome Culture" 4.0 - awesome introduction into finance (no prior ... Recommends,Positive Outlook,Approves of CEO 5.0 I have been working at Anonymous Investments f... 5.0 5.0 Current Employee - Investment Solutions Repres... 5.0 2020-06-18
1 Few jobs are off the phones. Call after call g... NaN NaN "Great job" NaN Great benefits, great people. Work from home f... Recommends,Positive Outlook,Approves of CEO NaN I have been working at Anonymous Investments f... 5.0 NaN Current Employee - Investment Solutions Repres... NaN 2020-06-16
2 Phone call center, must be good on the phones. 5.0 American Fork, UT "Great place to work" 4.0 Excellent benefits, plenty of opportunity to a... NaN 5.0 I worked at Anonymous Investments full-time fo... 5.0 5.0 Former Employee - High Net Worth Service Assoc... 5.0 2020-06-17
3 Easy to get fired. They don’t always give warn... 4.0 NaN "Good environment, not so stable." 3.0 Good environment \n\nNice scenery, cafeteria, ... Recommends 3.0 I worked at Anonymous Investments full-time 4.0 5.0 Former Employee - Financial Associate 3.0 2020-06-11
4 No cons in my opinion 5.0 Los Angeles, CA "Great company" 4.0 Good work environment in my opinion Recommends,Positive Outlook,Approves of CEO 5.0 I worked at Anonymous Investments full-time 5.0 4.0 Former Employee - Investment Consultant 5.0 2020-06-11
In [3]:
df.shape
Out[3]:
(6675, 14)
In [4]:
df.columns
Out[4]:
Index(['Con', 'Career Opportunities', 'Author Location', 'Summary',
       'Senior Management', 'Pro', 'Recommendation', 'Culture & Values',
       'Author Years', 'Overall Rating', 'Compensation and Benefits',
       'Author Title', 'Work/Life Balance', 'Comment Datetime'],
      dtype='object')
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6675 entries, 0 to 6674
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Con                        6675 non-null   object        
 1   Career Opportunities       5952 non-null   float64       
 2   Author Location            4054 non-null   object        
 3   Summary                    6675 non-null   object        
 4   Senior Management          5915 non-null   float64       
 5   Pro                        6675 non-null   object        
 6   Recommendation             5776 non-null   object        
 7   Culture & Values           5097 non-null   float64       
 8   Author Years               6675 non-null   object        
 9   Overall Rating             6675 non-null   float64       
 10  Compensation and Benefits  5961 non-null   float64       
 11  Author Title               5649 non-null   object        
 12  Work/Life Balance          5956 non-null   float64       
 13  Comment Datetime           6675 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(6), object(7)
memory usage: 782.2+ KB

1. Data Cleaning

In [6]:
# remove duplicate records

df.drop_duplicates(inplace = True)
len(df)
Out[6]:
6386
In [7]:
# check missing values

df.isnull().sum()
Out[7]:
Con                             0
Career Opportunities          697
Author Location              2507
Summary                         0
Senior Management             733
Pro                             0
Recommendation                857
Culture & Values             1515
Author Years                    0
Overall Rating                  0
Compensation and Benefits     689
Author Title                  980
Work/Life Balance             695
Comment Datetime                0
dtype: int64

Strategy of Handling Missing Values

Note: There are 8 columns with missing values. These columns include numerical rating columns and categorical string columns. We could have filled the missing ratings with each column's median value, however, in order to make this analysis more accurate, our strategy is to leave NaN numerical rating as is, and fill the missing categorical information with 'Unknown', so we can differentiate it from the originally existing data.

In [8]:
# sort the dataframe so all string columns are listed together

rating_columns = df.select_dtypes(include = ['float64'])
string_columns = df.select_dtypes(exclude = ['float64'])
df_column_sorted = pd.concat([string_columns, rating_columns], axis = 1)

df_column_sorted.head()
Out[8]:
Con Author Location Summary Pro Recommendation Author Years Author Title Comment Datetime Career Opportunities Senior Management Culture & Values Overall Rating Compensation and Benefits Work/Life Balance
0 - potentially slow upward movement beyond the ... Salt Lake City, UT "Awesome Culture" - awesome introduction into finance (no prior ... Recommends,Positive Outlook,Approves of CEO I have been working at Anonymous Investments f... Current Employee - Investment Solutions Repres... 2020-06-18 5.0 4.0 5.0 5.0 5.0 5.0
1 Few jobs are off the phones. Call after call g... NaN "Great job" Great benefits, great people. Work from home f... Recommends,Positive Outlook,Approves of CEO I have been working at Anonymous Investments f... Current Employee - Investment Solutions Repres... 2020-06-16 NaN NaN NaN 5.0 NaN NaN
2 Phone call center, must be good on the phones. American Fork, UT "Great place to work" Excellent benefits, plenty of opportunity to a... NaN I worked at Anonymous Investments full-time fo... Former Employee - High Net Worth Service Assoc... 2020-06-17 5.0 4.0 5.0 5.0 5.0 5.0
3 Easy to get fired. They don’t always give warn... NaN "Good environment, not so stable." Good environment \n\nNice scenery, cafeteria, ... Recommends I worked at Anonymous Investments full-time Former Employee - Financial Associate 2020-06-11 4.0 3.0 3.0 4.0 5.0 3.0
4 No cons in my opinion Los Angeles, CA "Great company" Good work environment in my opinion Recommends,Positive Outlook,Approves of CEO I worked at Anonymous Investments full-time Former Employee - Investment Consultant 2020-06-11 5.0 4.0 5.0 5.0 4.0 5.0

Let's take a look at a few text samples.

In [9]:
# randomly print out 3 records that have text

for i in range(4,7):
    print('Author Years:', df['Author Years'][i])
    print('Author Title:', df['Author Title'][i])
    print('Author Summary:', df['Summary'][i])
    print('Author Pro:', df['Pro'][i])
    print('Author Con:', df['Con'][i])
    print('====================')
Author Years: I worked at Anonymous Investments full-time
Author Title: Former Employee - Investment Consultant
Author Summary: "Great company"
Author Pro: Good work environment in my opinion
Author Con: No cons in my opinion
====================
Author Years: I have been working at Anonymous Investments full-time
Author Title: Current Employee - Marketing Associate
Author Summary: "Great place to work"
Author Pro: Great benefits, inclusive community, and highly collaborate people to work with.
Author Con: I don’t have any cons
====================
Author Years: I have been working at Anonymous Investments full-time for more than 8 years
Author Title: Current Employee - Senior Software Test Engineer
Author Summary: "Good company to work for"
Author Pro: Good culture and opportunities to grow
Author Con: Moving jobs to offshore for cost savings
====================

Extract Categorical Information

1. Extract information of Current/Former Employee from column Author Year.

Of note, there is a column "Author Title" which has "Current Employee" or "Former Employee" information, but there are more than 2000 null values. The other column "Author Years" column has no null value and the text contains relevant information about current/former employee flags, notice the string in column "Author Years" always starts with either "I have been working" or "I worked at".Hence, we can use "Author Year" to extract employee type data.

2. Extract employee Tenure from column Author Years.

Below are some sample data:

  • "I have been working at Anonymous Investments full-time for more than a year"
  • "I have been working at Anonymous Investments full-time for more than 3 years"
  • "I worked at Anonymous Investments full-time for more than 5 years"

Notice there are 3 different patterns.There could be a number before year(s)(i.e. 3 years) or a letter before year (i.e. 'a year') or 'more than/less than'. First, we will replace 'a year' as '1 year', then find all 'more than/less than'. If there is 'more than', we will add 0.5 year to the number of years, if there is 'less than', we will minus 0.5 year from the number of years. If no tenure is specified, we will set it to NaN.

3. Extract information of Full-time/Part-time Employee from column Author Year.

For full-time / part-time flags, we will get that information from the "Author Years" column as well. If this information is not specified, we will default it to NaN.

4. Extract Recommended, Positive Outlook, Approves of CEO from columnRecommendation.

Since the column 'Recommendation' consists of 3 pieces of information, we would need to separate them and make each feature as one column.

5. Extract employee location, job title from column Author Title.

As a national company, employees in different states can have different engagement, therefore location information is meaningful. However, we don't need to get the specific city, only the state information is sufficient. As of job titles, most job titles have this string format "Current Employee - Financial Associate", so we need to extract the job title after the "-". However, there are more than 1000 records missing this information, we will fill in the information with "Unknown Title". After that, we are going to remove "senior" and "principal" to get fewer job categories.

In [10]:
# define a function to extract categorical information - round 1

def extract_cat_data(row):
    
    # 1. extract current/former employee flags from'Author Years'
    if not pd.isna(row['Author Years']):
        if "have been working" in row['Author Years']:
            row['Current Employee'] = 1
        elif "I worked at" in row['Author Years']:
            row['Current Employee'] = 0
        else:
            row['Current Employee'] = Np.NaN          
    
    # 2. extract tenure from 'Author Years'
    string_to_number = row["Author Years"].replace("a year", "1 year")  # replace 'a year' with '1 year'
    tenure = re.findall(r'\d+', string_to_number)                       # find the digit in the string
    
    if tenure: 
        row['Tenure'] = int(tenure[0])                 # use the number in the list
        if 'more than' in row["Author Years"]:         
            row['Tenure'] += 0.5                       # add 0.5 year if there is 'more than'
        elif 'less than' in row["Author Years"]:       
            row['Tenure'] -=0.5                        # minus 0.5 year if there is 'less than'
    else:
        row['Tenure'] = np.NaN                         # if no tenure is specified, set to NaN
    
    
     # 3. extract full-time/part-time flags from 'Author Years'
    if 'full-time' in string_to_number or 'full time' in string_to_number:
        row['Full-time'] = 1
    elif 'part-time' in string_to_number or 'part time' in string_to_number:
        row['Full-time'] = 0
    else:
        row['Full-time'] = np.NaN                       # if not specified, set it NaN
 
    
    # 4. extract 'Recommended','Positive Outlook','Approves of CEO' from column'Recommendation' 
    row['Recommended'] = 0
    row['Positive Outlook'] = 0
    row['Approves of CEO'] = 0
    
    if not pd.isna(row['Recommendation']):
        if 'Recommends' in row['Recommendation']:
            row['Recommended'] = 1
        elif "Doesn't Recommend" in row['Recommendation']:
            row['Recommended'] = -1
        
        elif 'Positive Outlook' in row['Recommendation']:
            row['Positive Outlook'] = 1
        elif 'Negative Outlook' in row['Recommendation']:   
            row['Positive Outlook'] = -1
        elif 'Neutral Outlook' in row['Recommendation']: 
            row['Positive Outlook'] = 0
            
        elif 'Approves of CEO' in row['Recommendation']:
            row['Approves of CEO'] = 1
        elif 'Disapproves of CEO' in row['Recommendation']:
            row['Approves of CEO'] = -1
        elif 'No opinion of CEO' in row['Recommendation']:   
            row['Approves of CEO'] = 0

    return row

df_cat_extracted = df.apply(extract_cat_data, axis=1)
In [11]:
# define a function to extract categorical information - round 2

def extract_loc_job(row):
    
    # 1. extract location
    if not pd.isna(row['Author Location']):
        if re.search(r'[A-Z]{2}$',row['Author Location']): 
            # extract the last 2 captical letters as state
            row['State'] = re.search(r'[A-Z]{2}$',row['Author Location'])[0]
        else:
            row['State'] = np.NaN      
    else:
            row['State'] = np.NaN                               
    
    # 2. extract job title
    if pd.notnull(row['Author Title']) and row['Author Title']: 
        if '-'in row['Author Title']:  # author title usually starts like this: "Current Employee - Analyst" 
            row['Job Title'] = row['Author Title'].split("-")[1]  # get the 2nd element after the split 
        else:
            row['Job Title'] = row['Author Title']
    else:
         row['Job Title'] = 'Unknown Title'
    # remove "senior" and "principal" to get fewer job categories 
    # remove the beginning & end spaces
    row['Job Title'] = row['Job Title'].replace('Senior',"").replace('Principal',"").strip() 
    
    return row       
In [12]:
df_loc_job_filled = df_cat_extracted.apply(extract_loc_job,axis=1) 
In [13]:
# drop these columns since useful information is already extracted from them
df_cleaned = df_loc_job_filled.drop(columns = ['Recommendation', 'Author Title', 'Author Years', 'Author Location'])

Reorder Columns

Rearrange columns to make categorical columns together and rating columns together.

In [14]:
# reorder the columns

df_cleaned = df_cleaned[['Comment Datetime', 'State', 'Job Title','Tenure','Current Employee','Full-time',
                          'Summary','Pro','Con','Recommended', 'Positive Outlook','Approves of CEO',
                          'Overall Rating','Career Opportunities','Compensation and Benefits',
                          'Work/Life Balance','Senior Management','Culture & Values']]
In [15]:
# set 'Comment Datetime' as index so we can easily plot overall rating overtime later on

df_cleaned.set_index('Comment Datetime',inplace=True)
In [16]:
# sort dataframe by index

df_cleaned = df_cleaned.sort_index()
df_cleaned.head(1)
Out[16]:
State Job Title Tenure Current Employee Full-time Summary Pro Con Recommended Positive Outlook Approves of CEO Overall Rating Career Opportunities Compensation and Benefits Work/Life Balance Senior Management Culture & Values
Comment Datetime
2008-06-11 MA Director NaN 1 NaN "very political" privately owned company that can look at long ... risk averse management that does not know how ... -1 0 0 2.0 2.0 2.5 3.5 2.5 NaN
In [17]:
# check missing values one more time

df_cleaned.isnull().sum()
Out[17]:
State                        3389
Job Title                       0
Tenure                       2847
Current Employee                0
Full-time                    2006
Summary                         0
Pro                             0
Con                             0
Recommended                     0
Positive Outlook                0
Approves of CEO                 0
Overall Rating                  0
Career Opportunities          697
Compensation and Benefits     689
Work/Life Balance             695
Senior Management             733
Culture & Values             1515
dtype: int64

As mentioned earlier, we are going to leave the missing values without interpolation. Nex, we will do some exploratory analysis on numerical rating data.

2. Exploratory Data Analysis

In [18]:
df_cleaned.describe()
Out[18]:
Tenure Current Employee Full-time Recommended Positive Outlook Approves of CEO Overall Rating Career Opportunities Compensation and Benefits Work/Life Balance Senior Management Culture & Values
count 3539.000000 6386.00000 4380.000000 6386.000000 6386.000000 6386.000000 6386.000000 5689.000000 5697.000000 5691.000000 5653.000000 4871.000000
mean 4.409296 0.56248 0.990411 0.424679 0.003288 0.015346 3.712809 3.427140 3.832631 3.724653 3.226605 3.845001
std 3.450280 0.49612 0.097464 0.790138 0.081998 0.158545 1.110729 1.177508 1.013139 1.145637 1.259235 1.225665
min 0.500000 0.00000 0.000000 -1.000000 -1.000000 -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 1.500000 0.00000 1.000000 0.000000 0.000000 0.000000 3.000000 3.000000 3.000000 3.000000 2.000000 3.000000
50% 3.500000 1.00000 1.000000 1.000000 0.000000 0.000000 4.000000 3.500000 4.000000 4.000000 3.000000 4.000000
75% 5.500000 1.00000 1.000000 1.000000 0.000000 0.000000 5.000000 4.000000 5.000000 5.000000 4.000000 5.000000
max 10.500000 1.00000 1.000000 1.000000 1.000000 1.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

Boxplot of all ratings

In [19]:
# plot all ratings using boxplots

column_list = ['Overall Rating','Career Opportunities','Compensation and Benefits',
               'Work/Life Balance','Senior Management','Culture & Values']

figure, ax = plt.subplots(1,6,figsize=(12,5))  

for column, curr_ax in zip(column_list, ax.ravel()):  # use ax.ravel() to flatten ax(2 by 3) in order to zip
    curr_ax.boxplot(df_cleaned[column].dropna())      # drop those NaN values    
    curr_ax.set_title(f'{column}')

plt.tight_layout()
plt.show() 

On a scale of 1-5, this company's median overall rating is 4 which is pretty good. Breaking down to 5 categories, employees rated the highest in compensation & benefits, work/life balance and culture & values. Senior management rating's median value is 3, which is the lowest, and career opportunities'median rating is 3.5. Culture & Values has a median rating as 4. There are a few outliers in career opportunities and culture & values.

Lineplot of overall ratings over the years

In [20]:
# plot overall rating over the years

from datetime import datetime
fig,ax = plt.subplots(figsize=(10,5))

x = sorted(df_cleaned.index.year.unique())
y = df_cleaned.groupby(df_cleaned.index.year)['Overall Rating'].mean()

ax.plot(x, y, color='blue', marker = 'o',label='Overall Rating')
ax.set_title('Overall Rating vs. Year', fontsize=20)
ax.set_xlabel('Year')
ax.set_ylabel('Overall rating')
# plt.savefig('overall_rating_vs_year.png')

plt.show()                                    

There is a clear trend that the employee satisfaction has been increasing since 2008 with some small dips in 2011, 2013 and 2017.

Bar plot of overall recommendations

In [21]:
# plot overall recommendations

rec_count = df_cleaned['Recommended'].value_counts(normalize=True)

fig, ax = plt.subplots()
ax.bar(['Recommend', 'Unknown','Not Recommend'], rec_count, color=['tab:olive','tab:orange','tab:red'])
ax.set_title('Recommend Or Not?', fontsize=20)
ax.set_ylabel('Frenquency')
# plt.savefig('recommend_or_not.png')

plt.show()

Overall, 60% of employees recommend this company, 20% do not recommend it, another 20% do not answer this question.

Bar plots of sub-categories' ratings

In [22]:
# plot sub categories average ratings

column_list = ['Overall Rating','Career Opportunities','Compensation and Benefits',
               'Work/Life Balance','Senior Management','Culture & Values']

sub_ratings = df_cleaned[column_list].mean()
colors1=['tab:cyan','tab:orange','tab:red','tab:pink','tab:purple','tab:olive']

figure, ax = plt.subplots(figsize=(12,5))  
ax.bar(sub_ratings.index, sub_ratings, color=colors1)

ax.set_title ('Sub-categories Average Ratings', fontsize=20)
ax.set_ylabel ('Average Ratings')
ax.set_xticklabels(sub_ratings.index,rotation=45)
# figure.savefig('subcategory_rating.png', bbox_inches = 'tight')

plt.show() 

Bar plots of high/low rating states

In [23]:
# plot the 10 states with top ratings 

top_10 = df_cleaned.groupby('State')['Overall Rating'].mean().nlargest(10)
colors2 = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', 
           'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan']

fig,ax = plt.subplots(figsize=(10,5))
ax.bar(top_10.index,top_10, color=colors2)
ax.set_title('Top 10 States with Highest Ratings', fontsize=20)
ax.set_xlabel('States')
ax.set_ylabel('Overall rating')
# plt.savefig('top10_states_high_rate.png')

plt.show()
In [24]:
# plot the 5 states with lowest ratings 

lowest_5 = df_cleaned.groupby('State')['Overall Rating'].mean().nsmallest(5)
colors3 = ['tab:blue','tab:pink','tab:cyan','tab:orange','tab:purple']

fig,ax = plt.subplots(figsize=(10,5))
ax.bar(lowest_5.index,lowest_5, color=colors3)
ax.set_title('Top 5 States with Lowest Ratings',fontsize=20)
ax.set_xlabel('States')
ax.set_ylabel('Overall rating')
# plt.savefig('top5_states_low_rate.png')

plt.show()
In [25]:
# check how many employees submitted feedback in NE since its rating is so low

len(df_cleaned.loc[df_cleaned.State=='NE'])
Out[25]:
1

There is actually only 1 employee in NE submitted feedback to Glassdoor.

Plot mean overall ratings by states

In [26]:
total_states = df_cleaned.State.unique()
rate_state = df_cleaned.groupby('State')['Overall Rating'].mean()
In [27]:
import plotly.graph_objects as go
import numpy as np

# make the size exponential on the rating number so it is easier to see the difference
rating_size = np.exp(rate_state.values)

# plot the data in a map
fig = go.Figure(data=go.Scattergeo(
        locationmode = 'USA-states',
        locations = rate_state.index,
        text = rating_size,
        marker = dict(
            size = rating_size * 5,   # multiple by 5 to augment the dots
            line_color='rgb(40,40,40)',
            line_width=0.5,
            sizemode = 'area'), 
        hovertext= rate_state.round(2)))

fig.update_traces(texttemplate='%{text:.3s}')


fig.update_layout(
        title_text = 'Overall Mean Rating by State',
        showlegend = True,
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)',
        )
    )

fig.show()

If the above figure does not show, you may go to this website to view the interactive figure: teresanan.github.io

Looks like employees in KS and WI have the highest employee satisfaction and employees in NE (only 1 submitted review) are the unhappiest. Of note, the location data indicates 'Author Location', we assumed these employees are based locally in the company's branches in that particular state. If there are substantial remote employees, this conclusion may not be accurate.

Bar plots of overall ratings by employee type

In [28]:
# plot overall rating by full-time/part-time employee

rate_by_fte = df_cleaned.groupby('Full-time')['Overall Rating'].mean()

fig, ax = plt.subplots()
ax.bar(['Part_Time', 'Full_Time'], rate_by_fte,color=['tab:pink','tab:cyan'])
ax.set_title('Overall Ratings by Full/Part-time Employee', fontsize=20)
ax.set_ylabel('Overall rating')
# plt.savefig('rating_by_fulltime_parttime.png')

plt.show()

Part-time employees have higher satisfaction than full-time employees.

In [29]:
# plot overall rating by current/former employee

rate_by_emp_type = df_cleaned.groupby('Current Employee')['Overall Rating'].mean()

fig, ax = plt.subplots()
ax.bar(['Former Employee', 'Current Employee'], rate_by_emp_type,color=['tab:pink','tab:cyan'])
ax.set_title('Overall Ratings by Current/Former Employee', fontsize=20)
ax.set_ylabel('Overall rating')
# plt.savefig('rating_by_current_former.png')

plt.show()

Former employees give relatively lower ratings than current employees, this is understandable because unhappy employees tend to leave, happy employees choose to stay.

Bar plots of most frequent reviewer job titles

In [30]:
# plot the most frenquest reviewer job titles

top_20_job = df_cleaned['Job Title'].value_counts().nlargest(20)

plt.figure(figsize=(12,7))
sns.countplot(y='Job Title',data=df_cleaned, order=top_20_job.index)
sns.set_context('talk')
plt.title('Most Frequest Employee Job Titles', fontsize=20)
# figure.savefig('most_freq_job_title.png',bbox_inches = 'tight')

plt.show()

Most employees choose not to disclose their job titles, for those who disclose their titles, they are Software Engineers and Financial Representatives.

Bar plots of top job families' overall rating

In [31]:
top_20job_review = df_cleaned.loc[df_cleaned['Job Title'].isin(top_20_job.index), ['Job Title','Overall Rating']]
top_20job_mean_review = top_20job_review.groupby('Job Title')['Overall Rating'].mean().sort_values(ascending = False)
In [32]:
# plot the reviewers in the top 20 job families' overall rating

plt.figure(figsize=(12,7))
sns.barplot(y=top_20job_mean_review.index, x=top_20job_mean_review, hue_order=top_20job_mean_review)
sns.set_context('talk')
plt.title('Overall Rating by Job Family', fontsize=20)
# figure.savefig('rating_by_job_family.png',bbox_inches = 'tight')

plt.show()

Among the top most frequent job titles, Financial Associate group has the lowest satisfaction and Relationship Managers give the highest ratings.

In [33]:
df_cleaned.to_csv('df_cleaned1.csv')

Next, we will conduct sentiment analysis and topic modeling in another 2 separate Jupyter Notebooks.