library(palmerpenguins)
library(tidymodels)
library(tidyverse)
library(ggplot2)
When learning things I always like to use Palmer Penguins! It’s a fun dataset and after doing so many projects with iris data or car data it’s nice to change it up.
Exploring Data
This is a data set of observations of penguins, and here is a tibble so we can get an overview of what we know about them.
%>%
penguins head()
# A tibble: 6 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 1 more variable: sex <fct>
I want to see what the counts for some things are to get a feel for how many NA’s I have as the tibble above shows that some entries are NA.
%>%
penguins count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
%>%
penguins count(sex)
# A tibble: 3 × 2
sex n
<fct> <int>
1 female 165
2 male 168
3 <NA> 11
%>%
penguins count(island, species)
# A tibble: 5 × 3
island species n
<fct> <fct> <int>
1 Biscoe Adelie 44
2 Biscoe Gentoo 124
3 Dream Adelie 56
4 Dream Chinstrap 68
5 Torgersen Adelie 52
Okay so I have ~11 entries with NA. I think for this model I’ll try to make a classification model to see if I can corrently predict the sex based on it’s characteristics. I am going to do a little more exploration to see how feasible this is.
%>%
penguins filter(!is.na(sex)) %>%
ggplot(aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
geom_point(alpha = .7) +
facet_wrap(~species)
So notice that for all 3 species of penguins, female penguins have smaller bill lengths, flipper lengths and generally less body mass. There is some overlap, but it’s nothing very messy. This seems like some good indication that for all 3 species we can use these two feature to build our classifier. Also interesting that the smallest Gentoo is about as large as the biggest Adelie.
Building A Model
Let’s grab our data and store it in a data frame so we can start to build a model.
<- penguins %>%
penguins_df filter(!is.na(sex)) %>%
select(-island)
Now to split the data into testing and training sets.
set.seed(777)
<- initial_split(penguins_df, strata = sex)
penguin_split <- training(penguin_split)
penguin_training <- testing(penguin_split) penguin_testing
The default for initial split is a 75% split, and while this isn’t a large data set we can create a resampled data set using the training set to compute the performance of the models that I train. So using boostraps
I will create a set of objects each with ~250 training samples and ~90 testing samples (all with resampling) to have a larget dataset to train my classifier with.
set.seed(2134)
<- bootstraps(penguin_training)
penguin_boot penguin_boot
# Bootstrap sampling
# A tibble: 25 × 2
splits id
<list> <chr>
1 <split [249/91]> Bootstrap01
2 <split [249/89]> Bootstrap02
3 <split [249/89]> Bootstrap03
4 <split [249/84]> Bootstrap04
5 <split [249/93]> Bootstrap05
6 <split [249/82]> Bootstrap06
7 <split [249/89]> Bootstrap07
8 <split [249/94]> Bootstrap08
9 <split [249/92]> Bootstrap09
10 <split [249/96]> Bootstrap10
# ℹ 15 more rows
Now to make a basic logistic regression model and another model. What I am doing with this command is setting up what kind of model I want to train. I haven’t trained it, yet… the other will be a random forest because they usually perform well with their defaults. I am not worried about setting parameters, I just want to try out tinymodels.
library(ranger)
<- logistic_reg() %>%
glm_spec set_engine('glm')
<- rand_forest() %>%
rf_spec set_mode('classification') %>%
set_engine('ranger')
Now it’s time to get this into tinymodels and all working together. I am going to use workflow
for this. workflow
is a pre-processor and a model.
<- workflow() %>%
penguin_wf add_formula(sex ~ .)
penguin_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
sex ~ .
So notice that when we call this we have a pre=processor, and no model. There is no model yet because I am going to use this to call two times to build both models. I can reuse this workflow as many times as I want for different models which is neat!
<- penguin_wf %>%
glm_results add_model(glm_spec) %>%
# fit(data = penguins_df) I could do this which would fit the model one time, but I want to use all those resamples I made so I will use fit_resamples instead
fit_resamples(resamples = penguin_boot,
control = control_resamples(save_pred = T, verbose = T)
) glm_results
# Resampling results
# Bootstrap sampling
# A tibble: 25 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [249/91]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
2 <split [249/89]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
3 <split [249/89]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
4 <split [249/84]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
5 <split [249/93]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
6 <split [249/82]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
7 <split [249/89]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
8 <split [249/94]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
9 <split [249/92]> Bootstrap09 <tibble [2 × 4]> <tibble [1 × 3]> <tibble>
10 <split [249/96]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
# ℹ 15 more rows
There were issues with some computations:
- Warning(s) x2: glm.fit: fitted probabilities numerically 0 or 1 occurred
Run `show_notes(.Last.tune.result)` for more information.
<- penguin_wf %>%
rf_results add_model(rf_spec) %>%
fit_resamples(resamples = penguin_boot,
control = control_resamples(save_pred = T, verbose = T)
) rf_results
# Resampling results
# Bootstrap sampling
# A tibble: 25 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [249/91]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
2 <split [249/89]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
3 <split [249/89]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
4 <split [249/84]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
5 <split [249/93]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
6 <split [249/82]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
7 <split [249/89]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
8 <split [249/94]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
9 <split [249/92]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
10 <split [249/96]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]> <tibble>
# ℹ 15 more rows
Now to compare how they did.
collect_metrics(rf_results)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.901 25 0.00426 Preprocessor1_Model1
2 roc_auc binary 0.964 25 0.00300 Preprocessor1_Model1
# vs
collect_metrics(glm_results)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.910 25 0.00522 Preprocessor1_Model1
2 roc_auc binary 0.975 25 0.00239 Preprocessor1_Model1
These are very close together, perhaps the same. In this case I’d pick the linear model as it’s easier to interpret and faster to deploy. Since I saved the predictions we can start to look at them in more depth for the linear model.
%>%
glm_results conf_mat_resampled()
# A tibble: 4 × 3
Prediction Truth Freq
<fct> <fct> <dbl>
1 female female 40.7
2 female male 4.2
3 male female 4.08
4 male male 42.2
%>%
glm_results collect_predictions() %>%
group_by(id) %>%
roc_curve(sex, .pred_female) %>%
ggplot(aes(1 - specificity, sensitivity, color = id)) +
geom_abline(lty = 2, color = 'gray80', size = 1.5) +
geom_path(show.legend = F, alpha = 0.9) +
coord_equal()
Having more data would make this less jagged, but this gives us an understanding of what the variation is from resample to resample. Things look pretty good, everything is fairly clustered. It passes the smell test. Now it’s finally time to come back to the testing data to estimate the performance on new data. I’ll use the last_fit
function to fit the last and best fit to the testing data.
<- penguin_wf %>%
penguin_final add_model(glm_spec) %>%
last_fit(penguin_split)
collect_metrics(penguin_final)
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.893 Preprocessor1_Model1
2 roc_auc binary 0.967 Preprocessor1_Model1
collect_predictions(penguin_final) %>%
conf_mat(sex, .pred_class)
Truth
Prediction female male
female 40 7
male 2 35
Not to bad, I can see why some of the females would be classified as males as I did note the overlap in the exploration section. Overall this took less than an hour using packages I’ve never used and having to read the documentation for the workflow of these packages.
I want to take a look at a fitted workflow to see what appears in there now compared to when I last looked at a workflow, and we can look at the odds ratios to see how to interpret this model.
$.workflow[[1]] penguin_final
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
sex ~ .
── Model ───────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) speciesChinstrap speciesGentoo bill_length_mm
-91.797559 -8.434696 -11.983135 0.616936
bill_depth_mm flipper_length_mm body_mass_g
1.462226 0.085003 0.006811
Degrees of Freedom: 248 Total (i.e. Null); 242 Residual
Null Deviance: 345.2
Residual Deviance: 90.2 AIC: 104.2
$.workflow[[1]] %>%
penguin_finaltidy(exponentiate = T) %>%
arrange(estimate)
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.36e-40 16.0 -5.75 0.00000000900
2 speciesGentoo 6.25e- 6 3.27 -3.66 0.000253
3 speciesChinstrap 2.17e- 4 2.05 -4.12 0.0000373
4 body_mass_g 1.01e+ 0 0.00139 4.89 0.00000101
5 flipper_length_mm 1.09e+ 0 0.0546 1.56 0.120
6 bill_length_mm 1.85e+ 0 0.153 4.02 0.0000578
7 bill_depth_mm 4.32e+ 0 0.375 3.90 0.0000948
So for every 1 mm increase in bill depth for a penguin the odds of it being a male goes up over 4 times. So the bill depth is very important for penguins being male. I didn’t know that just bill depth would be such a good predictor. Let’s look at those plots from earlier now that we have the odds ratios.
%>%
penguins filter(!is.na(sex)) %>%
ggplot(aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
geom_point(alpha = .7) +
facet_wrap(~species)
These are pretty overllaped, so now that we know the odds are much better for bill_depth_mm
let’s put that on the x-axis and see if the data becomes less clustered.
%>%
penguins filter(!is.na(sex)) %>%
ggplot(aes(bill_depth_mm, bill_length_mm, color = sex, size = body_mass_g)) +
geom_point(alpha = .7) +
facet_wrap(~species)
This does seem to seperate them out more dramatically than flipper_length_mm
. Very cool.