Job Vs Salary Research Project

news
analysis
knn
machine learning
statistics
mathematics
research
Author

Kevin Swenson

Published

April 16, 2022

Intro

The last few years of politics in the US have given a lot of airtime to the discussion about fair wages and who is “overpaid in the workplace”. Recently discussion on UAF Confessions about professors needing better pay and admins being overpaid. What started as a facebook meme war made me wonder if there is an issue with pay in the UA system and if we can use statistics to detect it. This report is an exploratory report on UA pay data and if there is a meaningful link between job category and salary.

Data

# Read the data in
data <- read.csv("fy22_salary_data.csv", header = T)
head(data)
  Campus              Last     First                 School...Department
1    UAA              KING     HOLLY   BEI Business Enterprise Institute
2    UAA BRENNAN TOLLEFSON  KATHLEEN             UAA School of Education
3    UAA           KEMPKER    LOUISE             UAA School of Education
4    UAA            REDMON     SARAH    COH Div of Justice & Social Work
5    UAA           ROBERTS      JOEL          VCSA Auxiliary Enterprises
6    UAA             ADAME ELIZABETH COH Div of Clinical Health Sciences
  Job.Category Position Pay.scale  Salary Salary.Type Part.or.Full Leave
1        ADMIN        4     00_75 18397.6          NR           PT 0.512
2        ADMIN        5     00_75 37211.2          NR           FT 0.512
3        ADMIN       15     00_75 41059.2          NR           FT 0.512
4        ADMIN        9     00_75 38667.2          NR           FT 0.512
5        ADMIN       28     00_75 46696.0          NR           FT 0.512
6        ADMIN       17     00_75 41849.6          NR           FT 0.512

I obtained a dataset on current salaries from Lily Chohen’s website and exchanged a few emails with UA system office to ask a few clarifying questions. Initial dataset was ~6200 rows with 11 columns. One of the first things I decided to do with the data is throw out some of the columns such as B and C as they are names. I think something that merits more investigation is if the pay is fair between campuses (column A) and maybe if the pay is fair between schools (column D). Both of those questions will not be answered in this report.

The next step was to label these columns so I could have some context behind them. I threw out columns A (Campus Abbreviation), B (Last Name), C (First Name), D (School), K (leave accrument). So the new trimmed data looks like:

# Trim names out of data set and some other useless columns
data <- data[,5:10]
data <- data[,-2]
head(data)
  Job.Category Pay.scale  Salary Salary.Type Part.or.Full
1        ADMIN     00_75 18397.6          NR           PT
2        ADMIN     00_75 37211.2          NR           FT
3        ADMIN     00_75 41059.2          NR           FT
4        ADMIN     00_75 38667.2          NR           FT
5        ADMIN     00_75 46696.0          NR           FT
6        ADMIN     00_75 41849.6          NR           FT

Next was the hardest part, cleaning up the dataset. I had a script read through all the job titles and pull them out. I then had to clean up spelling errors and classifications (HVAC tech 1, tech 2, etc were consolidated to HVAC Tech). The next part is to try to categorize job titles down to a few categories. I decided to simplify down the jobs down to the broad categories: Student, Faculty, Admins, Service. I did have to look up what some of these jobs did since I didn’t know off the top of my head. The general definition for each one was: Student jobs are jobs done by students and paid a student wage. Faculty are any jobs that teach, Admins are managers and management support staff, Service are any jobs that provide service to the other jobs (janitorial, coaches, broadcast media). After this I turned everything into factors.

# Turn categorical into factors
data$Job.Category <- as.numeric(as.factor(data$Job.Category))
data$Pay.scale <- as.numeric(as.factor(data$Pay.scale))
data$Salary.Type <- as.numeric(as.factor(data$Salary.Type))
data$Part.or.Full <- as.numeric(as.factor(data$Part.or.Full))
head(data)
  Job.Category Pay.scale  Salary Salary.Type Part.or.Full
1            1        10 18397.6          11            2
2            1        10 37211.2          11            1
3            1        10 41059.2          11            1
4            1        10 38667.2          11            1
5            1        10 46696.0          11            1
6            1        10 41849.6          11            1
  1. Admin: Managers and any department role equivalent to manager
  2. Faculty: Anyone who is teaching
  3. Service: Anyone who services another job (HVAC tech, janitorial, police, retail)
  4. Student: Any job done by a student (Wood Center front desk, student firefighter)

Methods and Exploration

One of the first things I like to do when exploring data is just look at a histogram to see if there is any obvious trend. Since this set is a snapshot it looks like the jobs that pay little or $0 are very represented in the dataset, and you can’t see the outlier salaries because of how scaled the vertical is. Even if I adjust the bin size the extreme high end still doesn’t show very well as they are highly paid but there aren’t many above the $200k mark.

hist(data$Salary, breaks = 30)

Here is a pairs plot to see if we can gain any insight.

pairs(data)

This is maybe the most interesting graph in the set. I turned job category into a factor so I could use it in various models. The code for how each job type is labeled is below.

plot(data$Job.Category ~ data$Salary)

There does seem to be some clustering in how jobs salaries differ in each category. Maybe we could use a cluster analysis model to find the relation. I will use KNN and Neural Nets and see if one is better than the other at classifying the job type based on the dataset. There will be some more graphs in the R code section at the end of this report. I will only include the useful ones.

Kmeans vs Neural Net

# K Means
library(stats)
library(class)
R2s <- rep(NA, 10)

# We do want to scale the data here to make sure that the euclidean distance isn't skewed from salary
data_scale <- scale(data)
for(i in 1:10) {
  tmp_kmeans <- kmeans(data_scale[,2:5],centers = i, nstart = 1)
  R2s[i] <- tmp_kmeans$betweenss/tmp_kmeans$totss
}
plot(R2s, type = "l", lwd = 2, xlab = "Number of Centers")

# 4 appears the be the optimal number of centers.
# Now let's make sure that we generate this kmeans with the optimal number of clusters
tmp_kmeans_final <- kmeans(data_scale[,2:5],centers = 4, nstart = 1)
table(data$Job.Category, tmp_kmeans_final$cluster)
   
       1    2    3    4
  1  269  146  655   78
  2    6 1057    0  782
  3  721  196  480  779
  4   60    0    0  988

So Kmeans did a really bad job with clustering and and classifying testing data. I would not accept this model. I understand that groups 2 and 3 are close enough that misclassifying them is fairly understandable, but K means totally missed out on the student category.

# Neural Net
library(nnet)
library(caret)
Loading required package: ggplot2
Loading required package: lattice
#split the data and train for nn
split <- sample(c(rep(0,0.8 * nrow(data_scale)), rep(1,0.2 * nrow(data))))
test_data <- data[split == 1,]
train_data <- data[split == 0,]

# tmp_nn <- nnet(Job.Category ~ Pay.scale + Salary.Type + Salary + Part.or.Full, data = data, size = c(10))
# preds <- as.numeric(tmp_nn$fitted.values)
tmp_nn <- multinom(train_data$Job.Category~., data = train_data[,-1])
# weights:  24 (15 variable)
initial  value 6895.428152 
iter  10 value 5336.135362
iter  20 value 3527.323774
iter  30 value 3437.618370
iter  40 value 3435.329873
final  value 3435.024052 
converged
tmp_nn_preds <- predict(tmp_nn, newdata = data.frame(test_data[,-1]))
confusionMatrix(tmp_nn_preds, as.factor(test_data[,1]))
Confusion Matrix and Statistics

          Reference
Prediction   1   2   3   4
         1 132   2  51   0
         2  18 355  36   0
         3  95   0 296  29
         4   1   0  65 163

Overall Statistics
                                          
               Accuracy : 0.7611          
                 95% CI : (0.7364, 0.7845)
    No Information Rate : 0.3604          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6708          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            0.5366   0.9944   0.6607   0.8490
Specificity            0.9468   0.9391   0.8440   0.9372
Pos Pred Value         0.7135   0.8680   0.7048   0.7118
Neg Pred Value         0.8922   0.9976   0.8153   0.9714
Prevalence             0.1979   0.2872   0.3604   0.1545
Detection Rate         0.1062   0.2856   0.2381   0.1311
Detection Prevalence   0.1488   0.3290   0.3379   0.1842
Balanced Accuracy      0.7417   0.9667   0.7524   0.8931

Neural net did a great job with this data. I am less concerned about how many 2/3’s get miss classified in each other as those two groups had a lot of overlap.

Conclusions

It seems that NN did a great job classifying and K Means didn’t do as good as I’d like. Some of my concerns with this is that when classifying jobs into broader categories I could have put some in the admin section that were not administrative based on bias over their title. I also could have seen their pay and labeled them as admin due to my internal belief that admins get paid more. I think that the most important thing to take away from this exploration is that there does seem to be some relationship between pay and job category, but that need to clearly define what an admin is. If we want to talk about unfair pay we need to be explicit with our definitions.

I think a good follow-up to this exploration would be to look at the link between campus location and pay, and school and pay.