Principal Component Analysis (PCA) Introduction

In this post we we begin a short series on Principal Component Analysis (PCA). The first post on the topic will go through the basic steps for PCA. We start by ensuring our data is in the proper format for PCA and end by computing the k-dimensional Principal component representation for any observation in the data. We obtained the data from Kaggle at https://www.kaggle.com/danofer/india-census whose original data source comes in part from https://github.com/nishusharma1608/India-Census-2011-Analysis. For a slightly more in depth introduction to PCA in Python we recommend this article.

The objective of PCA is generally to reduce the number of variables needed to characterize a collection of observations while losing as little important data as possible. So for example if we have variables of Population, Male Population, and Female Population for a given county or observation we can remove one of the columns from the table since the Male Population plus Female Population equals Population. PCA does a similar, but more sophisticated version of this by finding the best linear combination of columns and keeping the collection of columns containing the most information, our so called Principal components, which in our case is related to maximizing variance.

# packages used
library(here)
library(dplyr)

Now, our data is from the 2011 census in India. Our goal is to find the most important Principal components. After reading in the data below we examine some of the features of the data.

india_census_data <- here("csv", "kaggle_2020_dec_project",
                          "india-districts-census-2011.csv") %>%
    read.csv()

We start our examination of the data by looking at the dimensions.

dim(india_census_data)

## [1] 640 118

This tells use that we have 640 observations with 118 variables. We print out the names for the first 12 variables below.

india_census_data %>% colnames() %>% head(12)

##  [1] "District.code"   "State.name"      "District.name"   "Population"     
##  [5] "Male"            "Female"          "Literate"        "Male_Literate"  
##  [9] "Female_Literate" "SC"              "Male_SC"         "Female_SC"

Upon closer inspection we can see that District.code is just a list of numbers from 1 to 640 and District.name is similarly unique and will not be useful for our analysis. In the next chunk of code we will remove these two columns as well and designate State.name as our response variable. We will also remove all other non-numeric columns to ensure our PCA works properly. We print out the first 6 rows and columns of our cleaned up matrix X containing now 115 numeric response variables.

# use "State.name" as response variable
y = india_census_data["State.name"]
# 
X <- dplyr::select_if(india_census_data, is.numeric)
X <- subset(X, select = -1)
X[1:6, 1:6]

##   Population   Male Female Literate Male_Literate Female_Literate
## 1     870354 474190 396164   439654        282823          156831
## 2     753745 398041 355704   335649        207741          127908
## 3     133487  78971  54516    93770         62834           30936
## 4     140802  77785  63017    86236         56301           29935
## 5     476835 251899 224936   261724        163333           98391
## 6     642415 345351 297064   364109        224469          139640

Notice that the response variables in X should have some columns which are highly correlated since Population totals should be the sum of Male and Female totals. This should hold true for the Literate column and many other columns which we do not list here.

Now that our data is cleaned up we need to standardize the columns of X to have mean 0 and standard deviation 1.

# step 1: standardize columns
X_std <- scale(X)

With our standardized data we compute the covariance matrix of X. Printing out the first 6 rows and columns we can see that our remark about highly correlated columns holds true.

# step 2: calculate covariance matrix
covariance_matrix <- cov(X_std)
print(round(covariance_matrix[1:6,1:6], 3))

##                 Population  Male Female Literate Male_Literate Female_Literate
## Population           1.000 0.999  0.999    0.973         0.983           0.952
## Male                 0.999 1.000  0.996    0.971         0.983           0.948
## Female               0.999 0.996  1.000    0.973         0.981           0.954
## Literate             0.973 0.971  0.973    1.000         0.997           0.995
## Male_Literate        0.983 0.983  0.981    0.997         1.000           0.985
## Female_Literate      0.952 0.948  0.954    0.995         0.985           1.000

We now calculate the Eigenvalues and corresponding Eigenvectors.

# step 3: calculate Eigenvalues and Eigenvectors
# note that eigen sorts Eigenvalues in descending order and corresponding
# Eigenvectors are the columns of a p by p matrix and have unit length
pairs <- eigen(covariance_matrix)

By ordering our Eigenvalues in descending order (which is done automatically with the eigen function) we can use the Eigenvectors corresponding to the largest Eigenvalues to create a projection matrix. This matrix can then transform an observation made with our original variables to one using a number k less than or equal to our our total number of variables. In our case we have k less than or equal 115.

So how do we choose k? This could be done in a variety of ways. One case may be you have a specific number of variables you need to work with. More likely you will want to keep a certain percentage of the information based on the variability. You may also consider choosing k up to such a point that the gain in information is minimal. We will consider these last two situations.

Below we want to determine how many Principal components we need to keep to retain 90% of the information in the original 115 variables. We calculate this by summing up our Eigenvalues from largest to smallest till the ratio of this sum is more than 0.9 or 90%.

# step 4: determine how many PCA variables we keep given
# required amount of information (variance) kept
p <- dim(covariance_matrix)[1]
individual_var = rep(0, p)
total_var = 0
for (index in 1:p){
    individual_var[index] = pairs$values[index]
}
total_var = sum(individual_var)

proportion_info_req <- 0.9

kept_var = 0
proportion_var = 0
num_kept_vars = 0
for (value in pairs$values){
    num_kept_vars = num_kept_vars + 1
    kept_var = kept_var + value
    proportion_var = kept_var / total_var
    #print(proportion_var)
    if (proportion_var >= proportion_info_req){
        break
    }
}
print(num_kept_vars)

## [1] 12

Our result is that we need 12 principal components to keep 90% of the variance from our original 115 variables.

Now, before we move on let’s examine graphically what happens as we add additional Eigenvalues starting with our largest and ending with our smallest. We also indicate the lines where our selected values would lie.

cumsum_var <- cumsum(individual_var)
cumsum_info <- cumsum_var / total_var
plot(cumsum_info[1:30], pch=21, bg="lightgray", col="black",
     ylab="Proportion of Information",
     xlab="Number of Principal Components",
     main="Information kept based on number of Principal Components")
abline(h=0.9, col="blue")
abline(v=12, col="red", lty=2)
legend(x = "bottomright",
       legend = c("Kept Information", "Kept Principal Components"),
       lty = c(1, 2),
       col = c("blue", "red"),
       lwd = 1)

As we can see in our graph we will keep just a little over 90% of the information when we include 12 Principal components. Indeed with just 30 Principal components we can keep nearly all of the information.

We finish by calculating the projection matrix which we use to transform our first observation from the original 115 variables to our 12 Principal components.

# step 5: calculate projection matrix
proj_matrix <- pairs$vectors[,1:num_kept_vars]

With our projection matrix ready we now calculate and print out our first observation in the new variables.

obs_1 <- X_std[1,]
pca_obs_1 <- t(proj_matrix) %*% obs_1
print(pca_obs_1)

##              [,1]
##  [1,]  6.10359217
##  [2,]  0.80440950
##  [3,] -0.43622664
##  [4,] -0.34573825
##  [5,]  0.51587120
##  [6,]  2.46998160
##  [7,]  2.15887194
##  [8,]  2.70138695
##  [9,] -1.25259529
## [10,]  0.08438091
## [11,]  0.09110779
## [12,]  1.70265280

Notice that these variables are no longer labeled. Indeed this method does now allow for us to interpret which of the original variables are most significant. The big advantage that we have is that we can explain 90% of the information about the variance of the data which originally required 115 variables with only 12 Principal components. This can be useful for doing analyses on data with many variables much quicker.

In future posts in this PCA series we plan to: create an R function to automate some of the PCA steps done above; explore how PCA affects computational performance and prediction when Principal components are used in a machine learning algorithm; and apply PCA to image processing and storage.