This dataset is about a large company's review on Glassdoor. We web scrapped the data from Glassdoor with its written permission. There are 6675 reviews with numerical ratings and text comments. To make this analysis general, the company name has been replaced with "Anonymous".
The goal of this analysis is to analyze anonymous employee reviews submitted on Glassdoor, with the hope to help employers gain real insights on their employee engagement. The analysis will answer these questions: what employees like and dislike about their company? Has the company’s reputation gotten better or worse in the recent year? Which job families have the highest and lowest satisfaction rates? What are the keywords that people say about this company? What can this company do to improve employee engagement?
This is part 1 of Glassdoor company review analysis project. In this part, we are going to extract categorical information from existing data, make unstructured data into structured data, clean data, and explore numerical data with visualization.
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# use the first column as index, parse 'Comment Datetime' to datetime object
df = pd.read_csv('company1_review.csv', index_col=[0], parse_dates=['Comment Datetime'])
df.head()
df.shape
df.columns
df.info()
# remove duplicate records
df.drop_duplicates(inplace = True)
len(df)
# check missing values
df.isnull().sum()
Note: There are 8 columns with missing values. These columns include numerical rating columns and categorical string columns. We could have filled the missing ratings with each column's median value, however, in order to make this analysis more accurate, our strategy is to leave NaN numerical rating as is, and fill the missing categorical information with 'Unknown', so we can differentiate it from the originally existing data.
# sort the dataframe so all string columns are listed together
rating_columns = df.select_dtypes(include = ['float64'])
string_columns = df.select_dtypes(exclude = ['float64'])
df_column_sorted = pd.concat([string_columns, rating_columns], axis = 1)
df_column_sorted.head()
Let's take a look at a few text samples.
# randomly print out 3 records that have text
for i in range(4,7):
print('Author Years:', df['Author Years'][i])
print('Author Title:', df['Author Title'][i])
print('Author Summary:', df['Summary'][i])
print('Author Pro:', df['Pro'][i])
print('Author Con:', df['Con'][i])
print('====================')
1. Extract information of Current/Former Employee
from column Author Year
.
Of note, there is a column "Author Title" which has "Current Employee" or "Former Employee" information, but there are more than 2000 null values. The other column "Author Years" column has no null value and the text contains relevant information about current/former employee flags, notice the string in column "Author Years" always starts with either "I have been working" or "I worked at".Hence, we can use "Author Year" to extract employee type data.
2. Extract employee Tenure
from column Author Years
.
Below are some sample data:
Notice there are 3 different patterns.There could be a number before year(s)(i.e. 3 years) or a letter before year (i.e. 'a year') or 'more than/less than'. First, we will replace 'a year' as '1 year', then find all 'more than/less than'. If there is 'more than', we will add 0.5 year to the number of years, if there is 'less than', we will minus 0.5 year from the number of years. If no tenure is specified, we will set it to NaN.
3. Extract information of Full-time/Part-time Employee
from column Author Year
.
For full-time / part-time flags, we will get that information from the "Author Years" column as well. If this information is not specified, we will default it to NaN.
4. Extract Recommended
, Positive Outlook
, Approves of CEO
from columnRecommendation
.
Since the column 'Recommendation' consists of 3 pieces of information, we would need to separate them and make each feature as one column.
5. Extract employee location, job title from column Author Title
.
As a national company, employees in different states can have different engagement, therefore location information is meaningful. However, we don't need to get the specific city, only the state information is sufficient. As of job titles, most job titles have this string format "Current Employee - Financial Associate", so we need to extract the job title after the "-". However, there are more than 1000 records missing this information, we will fill in the information with "Unknown Title". After that, we are going to remove "senior" and "principal" to get fewer job categories.
# define a function to extract categorical information - round 1
def extract_cat_data(row):
# 1. extract current/former employee flags from'Author Years'
if not pd.isna(row['Author Years']):
if "have been working" in row['Author Years']:
row['Current Employee'] = 1
elif "I worked at" in row['Author Years']:
row['Current Employee'] = 0
else:
row['Current Employee'] = Np.NaN
# 2. extract tenure from 'Author Years'
string_to_number = row["Author Years"].replace("a year", "1 year") # replace 'a year' with '1 year'
tenure = re.findall(r'\d+', string_to_number) # find the digit in the string
if tenure:
row['Tenure'] = int(tenure[0]) # use the number in the list
if 'more than' in row["Author Years"]:
row['Tenure'] += 0.5 # add 0.5 year if there is 'more than'
elif 'less than' in row["Author Years"]:
row['Tenure'] -=0.5 # minus 0.5 year if there is 'less than'
else:
row['Tenure'] = np.NaN # if no tenure is specified, set to NaN
# 3. extract full-time/part-time flags from 'Author Years'
if 'full-time' in string_to_number or 'full time' in string_to_number:
row['Full-time'] = 1
elif 'part-time' in string_to_number or 'part time' in string_to_number:
row['Full-time'] = 0
else:
row['Full-time'] = np.NaN # if not specified, set it NaN
# 4. extract 'Recommended','Positive Outlook','Approves of CEO' from column'Recommendation'
row['Recommended'] = 0
row['Positive Outlook'] = 0
row['Approves of CEO'] = 0
if not pd.isna(row['Recommendation']):
if 'Recommends' in row['Recommendation']:
row['Recommended'] = 1
elif "Doesn't Recommend" in row['Recommendation']:
row['Recommended'] = -1
elif 'Positive Outlook' in row['Recommendation']:
row['Positive Outlook'] = 1
elif 'Negative Outlook' in row['Recommendation']:
row['Positive Outlook'] = -1
elif 'Neutral Outlook' in row['Recommendation']:
row['Positive Outlook'] = 0
elif 'Approves of CEO' in row['Recommendation']:
row['Approves of CEO'] = 1
elif 'Disapproves of CEO' in row['Recommendation']:
row['Approves of CEO'] = -1
elif 'No opinion of CEO' in row['Recommendation']:
row['Approves of CEO'] = 0
return row
df_cat_extracted = df.apply(extract_cat_data, axis=1)
# define a function to extract categorical information - round 2
def extract_loc_job(row):
# 1. extract location
if not pd.isna(row['Author Location']):
if re.search(r'[A-Z]{2}$',row['Author Location']):
# extract the last 2 captical letters as state
row['State'] = re.search(r'[A-Z]{2}$',row['Author Location'])[0]
else:
row['State'] = np.NaN
else:
row['State'] = np.NaN
# 2. extract job title
if pd.notnull(row['Author Title']) and row['Author Title']:
if '-'in row['Author Title']: # author title usually starts like this: "Current Employee - Analyst"
row['Job Title'] = row['Author Title'].split("-")[1] # get the 2nd element after the split
else:
row['Job Title'] = row['Author Title']
else:
row['Job Title'] = 'Unknown Title'
# remove "senior" and "principal" to get fewer job categories
# remove the beginning & end spaces
row['Job Title'] = row['Job Title'].replace('Senior',"").replace('Principal',"").strip()
return row
df_loc_job_filled = df_cat_extracted.apply(extract_loc_job,axis=1)
# drop these columns since useful information is already extracted from them
df_cleaned = df_loc_job_filled.drop(columns = ['Recommendation', 'Author Title', 'Author Years', 'Author Location'])
Rearrange columns to make categorical columns together and rating columns together.
# reorder the columns
df_cleaned = df_cleaned[['Comment Datetime', 'State', 'Job Title','Tenure','Current Employee','Full-time',
'Summary','Pro','Con','Recommended', 'Positive Outlook','Approves of CEO',
'Overall Rating','Career Opportunities','Compensation and Benefits',
'Work/Life Balance','Senior Management','Culture & Values']]
# set 'Comment Datetime' as index so we can easily plot overall rating overtime later on
df_cleaned.set_index('Comment Datetime',inplace=True)
# sort dataframe by index
df_cleaned = df_cleaned.sort_index()
df_cleaned.head(1)
# check missing values one more time
df_cleaned.isnull().sum()
As mentioned earlier, we are going to leave the missing values without interpolation. Nex, we will do some exploratory analysis on numerical rating data.
df_cleaned.describe()
# plot all ratings using boxplots
column_list = ['Overall Rating','Career Opportunities','Compensation and Benefits',
'Work/Life Balance','Senior Management','Culture & Values']
figure, ax = plt.subplots(1,6,figsize=(12,5))
for column, curr_ax in zip(column_list, ax.ravel()): # use ax.ravel() to flatten ax(2 by 3) in order to zip
curr_ax.boxplot(df_cleaned[column].dropna()) # drop those NaN values
curr_ax.set_title(f'{column}')
plt.tight_layout()
plt.show()
On a scale of 1-5, this company's median overall rating is 4 which is pretty good. Breaking down to 5 categories, employees rated the highest in compensation & benefits, work/life balance and culture & values. Senior management rating's median value is 3, which is the lowest, and career opportunities'median rating is 3.5. Culture & Values has a median rating as 4. There are a few outliers in career opportunities and culture & values.
# plot overall rating over the years
from datetime import datetime
fig,ax = plt.subplots(figsize=(10,5))
x = sorted(df_cleaned.index.year.unique())
y = df_cleaned.groupby(df_cleaned.index.year)['Overall Rating'].mean()
ax.plot(x, y, color='blue', marker = 'o',label='Overall Rating')
ax.set_title('Overall Rating vs. Year', fontsize=20)
ax.set_xlabel('Year')
ax.set_ylabel('Overall rating')
# plt.savefig('overall_rating_vs_year.png')
plt.show()
There is a clear trend that the employee satisfaction has been increasing since 2008 with some small dips in 2011, 2013 and 2017.
# plot overall recommendations
rec_count = df_cleaned['Recommended'].value_counts(normalize=True)
fig, ax = plt.subplots()
ax.bar(['Recommend', 'Unknown','Not Recommend'], rec_count, color=['tab:olive','tab:orange','tab:red'])
ax.set_title('Recommend Or Not?', fontsize=20)
ax.set_ylabel('Frenquency')
# plt.savefig('recommend_or_not.png')
plt.show()
Overall, 60% of employees recommend this company, 20% do not recommend it, another 20% do not answer this question.
# plot sub categories average ratings
column_list = ['Overall Rating','Career Opportunities','Compensation and Benefits',
'Work/Life Balance','Senior Management','Culture & Values']
sub_ratings = df_cleaned[column_list].mean()
colors1=['tab:cyan','tab:orange','tab:red','tab:pink','tab:purple','tab:olive']
figure, ax = plt.subplots(figsize=(12,5))
ax.bar(sub_ratings.index, sub_ratings, color=colors1)
ax.set_title ('Sub-categories Average Ratings', fontsize=20)
ax.set_ylabel ('Average Ratings')
ax.set_xticklabels(sub_ratings.index,rotation=45)
# figure.savefig('subcategory_rating.png', bbox_inches = 'tight')
plt.show()
# plot the 10 states with top ratings
top_10 = df_cleaned.groupby('State')['Overall Rating'].mean().nlargest(10)
colors2 = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple',
'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan']
fig,ax = plt.subplots(figsize=(10,5))
ax.bar(top_10.index,top_10, color=colors2)
ax.set_title('Top 10 States with Highest Ratings', fontsize=20)
ax.set_xlabel('States')
ax.set_ylabel('Overall rating')
# plt.savefig('top10_states_high_rate.png')
plt.show()
# plot the 5 states with lowest ratings
lowest_5 = df_cleaned.groupby('State')['Overall Rating'].mean().nsmallest(5)
colors3 = ['tab:blue','tab:pink','tab:cyan','tab:orange','tab:purple']
fig,ax = plt.subplots(figsize=(10,5))
ax.bar(lowest_5.index,lowest_5, color=colors3)
ax.set_title('Top 5 States with Lowest Ratings',fontsize=20)
ax.set_xlabel('States')
ax.set_ylabel('Overall rating')
# plt.savefig('top5_states_low_rate.png')
plt.show()
# check how many employees submitted feedback in NE since its rating is so low
len(df_cleaned.loc[df_cleaned.State=='NE'])
There is actually only 1 employee in NE submitted feedback to Glassdoor.
total_states = df_cleaned.State.unique()
rate_state = df_cleaned.groupby('State')['Overall Rating'].mean()
import plotly.graph_objects as go
import numpy as np
# make the size exponential on the rating number so it is easier to see the difference
rating_size = np.exp(rate_state.values)
# plot the data in a map
fig = go.Figure(data=go.Scattergeo(
locationmode = 'USA-states',
locations = rate_state.index,
text = rating_size,
marker = dict(
size = rating_size * 5, # multiple by 5 to augment the dots
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode = 'area'),
hovertext= rate_state.round(2)))
fig.update_traces(texttemplate='%{text:.3s}')
fig.update_layout(
title_text = 'Overall Mean Rating by State',
showlegend = True,
geo = dict(
scope = 'usa',
landcolor = 'rgb(217, 217, 217)',
)
)
fig.show()
If the above figure does not show, you may go to this website to view the interactive figure: teresanan.github.io
Looks like employees in KS and WI have the highest employee satisfaction and employees in NE (only 1 submitted review) are the unhappiest. Of note, the location data indicates 'Author Location', we assumed these employees are based locally in the company's branches in that particular state. If there are substantial remote employees, this conclusion may not be accurate.
# plot overall rating by full-time/part-time employee
rate_by_fte = df_cleaned.groupby('Full-time')['Overall Rating'].mean()
fig, ax = plt.subplots()
ax.bar(['Part_Time', 'Full_Time'], rate_by_fte,color=['tab:pink','tab:cyan'])
ax.set_title('Overall Ratings by Full/Part-time Employee', fontsize=20)
ax.set_ylabel('Overall rating')
# plt.savefig('rating_by_fulltime_parttime.png')
plt.show()
Part-time employees have higher satisfaction than full-time employees.
# plot overall rating by current/former employee
rate_by_emp_type = df_cleaned.groupby('Current Employee')['Overall Rating'].mean()
fig, ax = plt.subplots()
ax.bar(['Former Employee', 'Current Employee'], rate_by_emp_type,color=['tab:pink','tab:cyan'])
ax.set_title('Overall Ratings by Current/Former Employee', fontsize=20)
ax.set_ylabel('Overall rating')
# plt.savefig('rating_by_current_former.png')
plt.show()
Former employees give relatively lower ratings than current employees, this is understandable because unhappy employees tend to leave, happy employees choose to stay.
# plot the most frenquest reviewer job titles
top_20_job = df_cleaned['Job Title'].value_counts().nlargest(20)
plt.figure(figsize=(12,7))
sns.countplot(y='Job Title',data=df_cleaned, order=top_20_job.index)
sns.set_context('talk')
plt.title('Most Frequest Employee Job Titles', fontsize=20)
# figure.savefig('most_freq_job_title.png',bbox_inches = 'tight')
plt.show()
Most employees choose not to disclose their job titles, for those who disclose their titles, they are Software Engineers and Financial Representatives.
top_20job_review = df_cleaned.loc[df_cleaned['Job Title'].isin(top_20_job.index), ['Job Title','Overall Rating']]
top_20job_mean_review = top_20job_review.groupby('Job Title')['Overall Rating'].mean().sort_values(ascending = False)
# plot the reviewers in the top 20 job families' overall rating
plt.figure(figsize=(12,7))
sns.barplot(y=top_20job_mean_review.index, x=top_20job_mean_review, hue_order=top_20job_mean_review)
sns.set_context('talk')
plt.title('Overall Rating by Job Family', fontsize=20)
# figure.savefig('rating_by_job_family.png',bbox_inches = 'tight')
plt.show()
Among the top most frequent job titles, Financial Associate group has the lowest satisfaction and Relationship Managers give the highest ratings.
df_cleaned.to_csv('df_cleaned1.csv')
Next, we will conduct sentiment analysis and topic modeling in another 2 separate Jupyter Notebooks.