Why use R?
- R has been the quintessensial language for Statistics for a while
- Comes batteries included – tons of datasets and data visualization tools
- RStudio & Shiny
(minor) Caveats
- Multiple R versions
- Bazillion packages to do the same thing
- Solution: Use packages from the
tidyverse universe if possible
- Installing packages on HPC clusters can sometimes be non-trivial
- Conda 😭
Resources for learning ML with R
Setting up R for ML on HPC
- R and RStudio are already installed on HPC.
- Highly recommend reading the HPC docs to troubleshoot R package installation issues.
Select Interactive Apps, and then from the drop-down menu select RStudio Server.

Fill in the details in the form that opens up, and select Launch.

After the session becomes available, select Connect to RStudio Server.

For today’s examples, install the palmerpenguins, and naivebayes packages.
install.packages(c("palmerpenguins, naivebayes"))
Incomplete datasets
Realistic datasets, like R’s airquality dataset, often come with missing values.
- Remove observations with missing entries
- Fill the missing entries
- Use models / algorithms that can account for missing entries (semi-supervised learning)
Download R script: data_prelim.R
Clustering 🐧
Cluster penguins into groups based on their bill features

Artwork by @allison_horst
- We will use $k$
-means clustering to cluster the penguins
- $k$
-means clustering partitions $n$
observations into $k$
clusters
- Each observation belongs to the cluster with the nearest mean (centroid)
- You have to specify the number of clusters
- Penguin data comes from the
palmerpenguins dataset
Download R script: penguins_kmeans.R
🍄 classification
Classify mushroom as edible or poisonous based on their physical features
Full dataset: Mushroom
Download R script with Naive Bayes classifier: mushroom_naivebayes.R