Principal Component Analysis (PCA) Function

In this post we create a function to perform PCA (Principal Component Analysis). We modify code from our PCA introduction post to create this function. As in our previous post we will use India census data from Kaggle at https://www.kaggle.com/danofer/india-census which originated from https://github.com/nishusharma1608/India-Census-2011-Analysis.

Our first step is loading up the packages here and dplyr.

# packages used
library(here)
library(dplyr)

Next we load our data.

filename <- "india-districts-census-2011.csv"
india_census_data <- here("csv",
                          "kaggle_2020_dec_project",
                          filename) %>%
    read.csv()

We now clean our data so that we have an entirely numeric covariate matrix X. We also extract State.name as a response variable though we do not use it in this post.

# use "State.name" as response variable
y = india_census_data["State.name"]
# extract numeric columns of census data
X <- dplyr::select_if(india_census_data, is.numeric)
X <- subset(X, select = -1)

Our next step is to create a function which applies PCA to numeric covariate matrices. The function takes a numeric covariate matrix for the data parameter. We allow the user to perform PCA in two different ways specified by the model parameter. The default setting for model is ‘info’ which means we need to specify how much information due to variance we want to keep after applying PCA. This is done by specifying the parameter input as a number between 0 and 1 for the proportion of information kept. If model is set to fixed we need to specify the number of dimensions (or columns) to keep after applying PCA. This is done by specifying the parameter input as a number between 0 and the total number of dimensions in the original covariate matrix. The parameter full determines what info we return. If full is FALSE we only return a projection matrix for the given covariate matrix. If full is TRUE we return a list containing in order: the projection matrix, the number of dimensions kept, and the true proportion of information kept.

pca_function <- function(data, input, model="info", full=FALSE){
    # the data parameter should be a p by k 
    #   covariate matrix with p observations
    #   and k dimensions
    #
    # use model to change between input types:
    #   model = "fixed" returns a fixed number of 
    #       variables where input must be integer less
    #       than or equal total original variables
    #   model = "info" returns a number of variables
    #       need to keep a certain amount of information
    #       where info must be between 0 and 1
    #       note: if input is 1 then we keep all variables
    #       even if rounding could result in fewer.
    # default is model="info"
    # 
    # use full = TRUE to include dimension and true
    #   proportion of information kept along with
    #   projection matrix
    # use full = FALSE to only include projection matrix
    
    if (model == "info"){
        proportion_info_req = input
    } else if (model=="fixed"){
        if (((input*2) %% 2 == 0) && (input >= 0)){
            kept_var = input
        }
        else {
            return()
        }
    } else {
        return()
    }
    
    # step 1: standardize columns
    std_data <- scale(data)
    
    # step 2: calculate covariance matrix
    covariance_matrix <- cov(std_data)
    
    # step 3: calculate Eigenvalues and Eigenvectors
    # note that eigen sorts Eigenvalues in descending
    # order and corresponding Eigenvectors are the 
    # columns of a p by p matrix and have unit length
    pairs <- eigen(covariance_matrix)
    
    # step 4: determine how many PCA variables we keep
    # given required amount of information (variance) kept
    p <- dim(covariance_matrix)[1]
    
    individual_var = rep(0, p)
    total_var = 0
    for (index in 1:p){
        individual_var[index] = pairs$values[index]
    }
    total_var = sum(individual_var)
        
    if (model == "info"){    
        kept_var = 0
        proportion_var = 0
        k = 0  # number of kept variables
        for (value in pairs$values){
            k = k + 1
            kept_var = kept_var + value
            proportion_var = kept_var / total_var
            if (proportion_var >= proportion_info_req){
                break
            }
        }
    } else if (model == "fixed"){
        k = min(kept_var, p)
    }
    
    # step 5: calculate projection matrix
    proj_matrix <- pairs$vectors[,1:k]
    
    # step 6: return solution
    if (!full){
        return(proj_matrix)
    } else {
        retained_info <- sum(individual_var[1:k]) / total_var
        return(list(proj_matrix, k, retained_info))
    }
}

We now test our function. In our previous post on PCA we wanted to keep 90% of the information contained in the variance. This resulted in us going from 115 to 12 dimensions of data. We check that the results are the same when using our function.

# Using function with information criteria
proj_matrix <- pca_function(X, 0.9)
X_std <- scale(X)
obs_1 <- X_std[1,]
method1 <- t(proj_matrix) %*% obs_1
method1 %>% print()

##              [,1]
##  [1,]  6.10359217
##  [2,]  0.80440950
##  [3,] -0.43622664
##  [4,] -0.34573825
##  [5,]  0.51587120
##  [6,]  2.46998160
##  [7,]  2.15887194
##  [8,]  2.70138695
##  [9,] -1.25259529
## [10,]  0.08438091
## [11,]  0.09110779
## [12,]  1.70265280

# Using function with number of columns kept criteria
proj_matrix <- pca_function(X, 12, model="fixed")
X_std <- scale(X)
obs_1 <- X_std[1,]
method2 <- t(proj_matrix) %*% obs_1
identical(method1, method2) %>% print()

## [1] TRUE

Above we print our results from requiring our function to keep at least 90% of the information. We next confirm that requiring 12 dimensions results in the same list of values using the identical function. Additionally, we can check that this result is the same as in our previous PCA post.

Now that we are confident that our function works properly, we want to see what happens when full is set equal TRUE. Trying this while requiring that we keep at least 90% of the information in the variance we look at the structure of our output. We can see that we have a list containing 3 objects. Our first object is a 115 by 12 matrix. This matrix allows us to transform our data from 115 dimensions to only 12 dimensions. The next object contains the number of dimensions we will keep after transforming. Our final object contains the true proportion of information that we keep. In this case we actually keep a proportion of 0.903 of our information, just slightly higher than 0.9.

output <- pca_function(X, .9, full=TRUE)
str(output) %>% print()

## List of 3
##  $ : num [1:115, 1:12] -0.121 -0.121 -0.121 -0.123 -0.123 ...
##  $ : num 12
##  $ : num 0.903
## NULL

Now that we have a working PCA function our next posts on PCA will focus on the applications of PCA.