Principal component analysis

In this chapter we will discuss how to use the PCA method implemented in mdatools. Besides that, we will use PCA examples to introduce some principles, which are common for most of the other methods (e.g. PLS, DD-SIMCA, PLS-DA, etc.) available in this package. This includes model and result objects, getting and visualization of performance statistics, validation, use of different kinds of plots, and so on.

Principal component analysis is one of the methods that decompose a data matrix \(\mathbf{X}\) into a combination of three matrices: \(\mathbf{X} = \mathbf{TP}^\mathrm{T} + \mathbf{E}\). Here \(\mathbf{P}\) is a matrix with unit vectors, defined in the original variable space. The unit vectors, also known as loadings, form a new basis — principal components. The components are mutually orthogonal and oriented in variable space to capture the direction of maximum variation of data points.

The data points are being projected to the principal components. Coordinates of these projections in principal component space, known as scores, form matrix \(\mathbf{T} = \mathbf{XP}\). The product of scores and loadings, \(\mathbf{TP}^\mathrm{T}\) gives coordinates of the projections in the original variable space. Matrix \(\mathbf{E}\) contains residuals — difference between position of projected data points and their original locations. This difference is a part of data variation, which the PCA model does not capture, hence the name.

If the original data matrix has \(I\) rows (observations, objects) and \(J\) variables and PCA decomposition is made with \(A\) components, then matrix \(\mathbf{P}\) will have dimension \(J\times A\), matrix \(\mathbf{T}\) — \(I\times A\), and \(\mathbf{TP}^\mathrm{T}\) and \(\mathbf{E}\) will have the same dimension as the original data matrix, \(\mathbf{X}\).

The relationship between data objects and principal component space (PCA model) can be described using two distances (in some literature they are also called residual distances) — orthogonal distance, OD, and score distance, SD. The orthogonal distance is a squared Euclidean distance between the position of an object and its projection in original variable space. It can be computed by taking the sum of squared values from matrix \(\mathbf{E} = \{e_{ij}\}\) along every row:

\[q_i = \sum_{j=1}^{J} e_{ij}^2\]

This distance usually is denoted as \(Q\) or \(q\). It can be considered as a lack of fit measure for this particular object.

The score distance is a squared Mahalanobis distance between the projection of the objects and the origin. It is a measure of the extremeness of an object and usually denoted as \(h\) or \(T^2\). The latter is used because the Hotelling \(T^2\) distribution is often used to describe the distribution of the score distance values, so in many software packages the distance is called the Hotelling \(T^2\) distance. The distance can be computed using standardized scores (so score values for every component have unit variance).

\[h_i = \sum_{a = 1}^{A} \frac{t_{ia}^2}{\lambda_a} \]

Here \(\lambda\) are eigenvalues for the principal components, which correspond to the variance of the corresponding scores.

Both score and orthogonal distances are important statistics allowing one to assess how well objects are described by PCA model. They can be assessed visually, using the Distance plot, — scatter plot where orthogonal distance is plotted against the score distance for a particular number of components.

Both distances can be described using theoretical distributions. This helps to identify regular and extreme objects as well as outliers. All details will be shown later in this tutorial.

There are several methods to compute PCA loadings, including Singular Values Decomposition (SVD) and Non-linear Iterative Partial Least Squares (NIPALS). Both methods are implemented in this package and can be selected using method argument of main class pca. By default SVD is used. In addition, one can use a randomized version of the two methods, which can be efficient if the data matrix contains a large number of rows. This is explained in the last part of this chapter.