Principal Component Analysis - PCA
PCA is a dimensionality reduction algorithm. It is used to find the direction of largest variance in high dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one.
PCA is also used as a tool for visualization, where we can visualize high dimensional data in a lower dimensional space.
Algorithm
The algorithm is as follows:
- Standardize the data.
- Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
- Sort eigenvalues in descending order and choose the \(k\) eigenvectors that correspond to the \(k\) largest eigenvalues where \(k\) is the number of dimensions of the new feature subspace (\(k\leq d\)).
- Construct the projection matrix \(W\) from the selected \(k\) eigenvectors.
- Transform the original dataset \(X\) via \(W\) to obtain a \(k\)-dimensional feature subspace \(Y\).
import numpy as np
import scipy.linalg as la
def get_eigenvectors(data_matrix, k):
"""
Calculates the eigenvectors of the data_matrix and returns
k eigenvectors, sorted descending to their value.
"""
cov_matrix = np.cov(data_matrix.T)
n = cov_matrix.shape[0]
_, eigenvectors = la.eigh(cov_matrix, subset_by_index=[n - k, n - 1])
return np.flip(eigenvectors, axis=1)
def get_mean(data_matrix):
"""
Calculates the mean of the data_matrix.
"""
return np.mean(data_matrix, axis=0)
def compress_pca_vector(vector, mean, eigenvectors):
p = (vector - mean).reshape(-1, 1)
return (eigenvectors @ (eigenvectors.T @ p)) + mean.reshape(-1, 1)
if __name__ == '__main__':
import load_data as ld
data = ld.load_data()
print(compress_pca_vector(data['vecTest'], data['mTest'], data['vEigenTest1']))