# Guide to Principal Component Analysis in Data Science

## Simplifying Data with Principal Component Analysis: A Beginner's Guide

**Origin of PCA****Approach of PCA****How to do PCA****When is PCA used****What PCA is and is not****Advantages and Disadvantages of PCA****Metric of Evaluation****After PCA, what Next?****Case Study of the Cocktail recipe Dataset**

**Origin of PCA**

PCA was first developed by Karl Pearson in 1901 and later developed by Hotelling in 1933.Pearson was a statistician and what he was working on when he developed PCA isn't entirely documented but we can suggest two things that could have prompted the development of the mathematical technique which are:

Biometry(The application of Statistical methods to Biology)

Factor Analysis

PCA applies an Orthogonal linear transformation to reframe the original correlated data in a new coordinate framework called Principal Components which are linearly uncorrelated. The largest variance(at least 80%) is usually explained in the First Component i.e. PC1 and the least component has the least Variance

## Approach of PCA

There are two main approach of the PCA which are:

Eigenvalue decomposition of the data covariance matrix

Singular-Value Decomposition of the centered Data matrix

PCA is commonly used when many of the variables are highly correlated with each other and it's desirable to reduce their number to an independent set, the idea of PCA is to reduce the number of variables of a dataset while preserving as much information as possible

## How to do PCA

This steps are done in the backend of the tool used to create the Principal Components

Standardize the range of continuous initial variables

Compute the covariance matrix to identify correlation

Compute the Eigenvalues and Eigenvectors of the covariance matrix to identify the Principal Components

Create a feature vector to decide which Principal Components to keep

Replot the data along the PC axes

## When is PCA used?

PCA is used when we wish to reduce the dimensionality of a large dataset to a small and explains important information in the data

## What PCA is and isn't

PCA is:

Is a Mathematical Technique and an Unsupervised Learning method(i.e. we use unlabeled data to process our data)

Is a Dimensionality Reduction Technique

PCA is not:

Not a data Cleaning method

Not a data Visualization technique

Not a data Transformation Technique

Not a feature selection method

Not a Model or algorithm

Not a perfect solution to all cases

## Advantages and Disadvantages of PCA

### Advantages

Removes Noise from Data

Removes Multicollinearity

Reduces Model Parameter

Improves Model Performance

Reduces Computational Cost

### Disadvantages

No Feature Interpretability

Only offers Linear Dimensionality Reduction

Affected by Outliers

Loss of Information

High Run-time

## Metrics of Evaluation

There isn't a single Evaluation Metric that is generally accepted for PCA, as it measures the proportion of the total variance in the data that is captured by the Principal Component.

Explained Variance

Scree Plot

Application-Specific Metric

## After PCA, What Next?

After PCA has been done on a large dataset, the components can be used for the following

Visualization

Scatter Plot

Parallel Coordinates

Machine Learning - If you plan to use the data for ML task, PCA may be a good preprocessing task as using the transformed data can potentially improve model performance

Anomaly Detection - PCA can be used to identify data points that are significantly different from the lower-dimensional representation captured by the data

## Case Study

### Boston Cocktail Recipe Dataset

```
# Loading the Necessary Libraries
library(tidyverse) # A Powerful package used for Data Antasks
library(tidymodels) # A Powerful package used for Building Models
library(janitor)
# Loading the Dataset
boston <- read_csv("Machine Learning Dataset/boston_cocktails.csv")
glimpse(boston)
```

The glimpse() function helps us to understand the structure of our data, we can also use the skim(), skim_without_chart() function from the "skimr" package and many more

We have 4 character variables and 2 double variables, we still need to clean the data and put it in a format that we can run our Unsupervised learning method on

```
boston_parsed <-
boston |>
mutate(
ingredient = str_to_lower(ingredient),
ingredient = str_replace_all(ingredient, "-", " "),
ingredient = str_remove(ingredient, "liqueur|(if desired)"),
ingredient = case_when(
str_detect(ingredient, "bitters") ~ "bitters",
str_detect(ingredient, "orange") ~ "orange juice",
str_detect(ingredient, "lemon") ~ "lemon juice",
str_detect(ingredient, "lime") ~ "lime juice",
str_detect(ingredient, "grapefruit") ~ "grapefruit juice",
TRUE ~ ingredient
),
measure = case_when(
str_detect(ingredient, "bitters") ~ str_replace(measure, "oz$", "dash"),
TRUE ~ measure
),
measure = str_replace(measure, " ?1/2",".5"),
measure = str_replace(measure, " ?3/4", ".75"),
measure = str_replace(measure, " ?1/4", ".25"),
measure_number = parse_number(measure),
measure_number = ifelse(str_detect(measure, "dash$"),measure_number / 50, measure_number)
) |>
add_count(ingredient) |>
filter(n > 15) |>
select(-n) |>
distinct(row_id, ingredient, .keep_all = TRUE)
boston_parsed
```

```
boston_df <-
boston_parsed |>
select(-row_id, -ingredient_number, -measure) |>
pivot_wider(names_from = ingredient,values_from = measure_number,
values_fill = 0) |>
clean_names() |>
na.omit()
boston_df
```

pivot_wider() is used to put our data in a wider format and the clean_names() is a function from the lubridate package

```
boston_pca <-
boston_df |>
recipe( ~.) |>
update_role(name, category,new_role = "id") |>
step_normalize(all_predictors()) |>
step_pca(all_predictors())
prep_pca <- prep(boston_pca)
prep_pca
```

The recipe() function is from the recipe package and offers numerous data preprocessing functions and steps we used the step_normalize() to center and scale our data and use the the update_role() to tell our algorithm that the columns selected are nor required in the learning and should be used only for "id" which can be seen in the image below

```
tidied_pca <- tidy(prep_pca,2)
tidied_pca
```

The tidy() function from the tidyr package can be used to extract our Principal Components

```
tidied_pca |>
filter(component %in% paste0("PC", 1:5)) |>
mutate(component = fct_inorder(component)) |>
ggplot(aes(value,terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(component), nrow = 1) +
labs(Y = NULL)
```

The final chart is showing the first 5 Principal Components and as stated earlier, the first 2 Principal Components i.e. PC1 and PC2 has more variance and explains more than 90% of the data dimensions