ML with R on HPC

Why use R?

Multiple R versions
Bazillion packages to do the same thing
- Solution: Use packages from the tidyverse universe if possible
Installing packages on HPC clusters can sometimes be non-trivial
- Solution: Read the HPC docs
Conda 😭

R and RStudio are already installed on HPC.
Highly recommend reading the HPC docs to troubleshoot R package installation issues.

Navigate to https://ood.hpc.arizona.edu/. After login, you will see the Open OnDemand dashboard.

Select Interactive Apps, and then from the drop-down menu select RStudio Server.

Fill in the details in the form that opens up, and select Launch.

After the session becomes available, select Connect to RStudio Server.

For today’s examples, install the palmerpenguins, and naivebayes packages.

install.packages(c("palmerpenguins, naivebayes"))

Realistic datasets, like R’s airquality dataset, often come with missing values.

Remove observations with missing entries
Fill the missing entries
Use models / algorithms that can account for missing entries (semi-supervised learning)

Download R script: data_prelim.R

Cluster penguins into groups based on their bill features

Download R script: penguins_kmeans.R

Classify mushroom as edible or poisonous based on their physical features

Full dataset: Mushroom

Download R script with Naive Bayes classifier: mushroom_naivebayes.R