Understanding PLS : A Computational Walkthrough
I used PLS some time ago recently wanted to improve on that work. Therefore I needed to better understand it. I made some readings;
PLS Mathematical Understanding Roadmap
1. Core PLS Algorithm (NIPALS)
- Article: “The NIPALS algorithm for PLS” by Wold et al. (about 5-7 pages)
- Focus: Step-by-step walkthrough of the iterative algorithm
2. Loading and Score Vectors
- Chapter 2 in “PLS Path Modeling with R” by Sanchez (10-15 pages)
- Focus: Mathematical interpretation of loadings and scores
3. Deflation Procedures
- Section 3 in “PLS Regression Methods” by Abdi (4-6 pages)
- Focus: Matrix deflation and orthogonalization techniques
4. Prediction Equations
- Article: “A mathematical view of PLS regression formulas” by Mevik & Wehrens (7-8 pages)
- Focus: Derivation of the final prediction equations
5. Various PLS Variants
- Paper: “A survey of partial least squares (PLS) methods, with emphasis on the two-block case” by Rosipal & Krämer (specific sections 2.1-2.3, about 10 pages)
- Focus: Mathematical differences between PLSR, SIMPLS, and other algorithms
6. Optimization Perspective
- Article: “PLS viewed through the lens of convex optimization” in Journal of Machine Learning Research (sections 2-3, about 8 pages)
- Focus: Understanding PLS as an optimization problem
for it and briefly explaining how I think it works in computational mechanic standpoint.
What is PLS and Why Do We Need It?
PLS (Partial Least Squares) addresses a common problem in data science: how to handle situations where:
You have many predictor variables (X) that are highly correlated
You want to predict one or more response variables (Y)
The number of observations is potentially smaller than the number of variables
Traditional regression methods like OLS (Ordinary Least Squares) break down in these scenarios due to multicollinearity. PLS solves this by finding a low-dimensional representation that captures the essential relationships between X and Y.
Key Components of PLS
Let’s break down the key variables and components involved in PLS modeling:
Input Matrices
- X: Matrix of predictor variables (n × p), where n is the number of observations and p is the number of predictor variables
- Y: Matrix of response variables (n × q), where q is the number of response variables
Latent Variables
- T: X-scores (n × a), where a is the number of components
- U: Y-scores (n × a)
Loadings
- P: X-loadings (p × a)
- Q: Y-loadings (q × a)
Weights
- W: X-weights (p × a), used to calculate the scores
Inner Relationship
- B: Diagonal matrix of regression coefficients between U and T
The Computational Process of PLS
Now, let’s walk through how PLS works computationally:
1. Initialization
We start by standardizing both X and Y matrices (mean-centering and often scaling to unit variance):
\[X_0 = X - \mathbf{1}\bar{X}^T\] \[Y_0 = Y - \mathbf{1}\bar{Y}^T\]
Where \(\mathbf{1}\) is a vector of ones and \(\bar{X}\) and \(\bar{Y}\) are the means of each column.
2. The NIPALS Algorithm (for each component a = 1, 2, …, A)
Step 1: Initialize u
We start by selecting an initial value for the Y-score vector u, typically the first column of Y:
\[\mathbf{u}_a = \mathbf{Y}_{a-1}[:,1]\]
Step 2: Calculate X-weights
The weights represent the direction in X-space that has maximum covariance with u:
\[\mathbf{w}_a = \frac{\mathbf{X}_{a-1}^T \mathbf{u}_a}{||\mathbf{X}_{a-1}^T \mathbf{u}_a||}\]
The normalization ensures that the weights have unit length.
Step 3: Calculate X-scores
The X-scores are a projection of X onto the weights:
\[\mathbf{t}_a = \mathbf{X}_{a-1} \mathbf{w}_a\]
Step 4: Calculate Y-weights
The Y-weights explain the direction in Y-space most associated with t:
\[\mathbf{q}_a = \frac{\mathbf{Y}_{a-1}^T \mathbf{t}_a}{||\mathbf{Y}_{a-1}^T \mathbf{t}_a||}\]
Step 5: Update Y-scores
The Y-scores are a projection of Y onto the Y-weights:
\[\mathbf{u}_a^{new} = \mathbf{Y}_{a-1} \mathbf{q}_a\]
Step 6: Check convergence
If \(\mathbf{u}_a^{new} \approx \mathbf{u}_a\) (within some tolerance), continue to step 7. Otherwise, set \(\mathbf{u}_a = \mathbf{u}_a^{new}\) and return to step 2.
Step 7: Calculate X-loadings
The X-loadings represent how the original X variables load onto the component:
\[\mathbf{p}_a = \frac{\mathbf{X}_{a-1}^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}\]
Step 8: Calculate inner relationship
The inner relationship links the X-scores to the Y-scores:
\[b_a = \frac{\mathbf{u}_a^T \mathbf{t}_a}{\mathbf{t}_a^T \mathbf{t}_a}\]
Step 9: Deflate X and Y
We remove the explained variance from both X and Y:
\[\mathbf{X}_a = \mathbf{X}_{a-1} - \mathbf{t}_a \mathbf{p}_a^T\] \[\mathbf{Y}_a = \mathbf{Y}_{a-1} - b_a \mathbf{t}_a \mathbf{q}_a^T\]
Step 10: Store results and iterate
Store the vectors \(\mathbf{w}_a\), \(\mathbf{t}_a\), \(\mathbf{p}_a\), \(\mathbf{q}_a\), \(\mathbf{u}_a\), and \(b_a\) and continue to the next component.
The Meaning Behind Each Component
Let’s understand what each calculated variable actually means:
X-weights (w)
- What they are: Directions in the X-space that maximize covariance with Y-scores
- Why they matter: They determine how the original variables are combined to form the PLS components
- Similarity to other models: Unlike PCA weights which maximize variance in X alone, PLS weights take Y into account
X-scores (t)
- What they are: Projections of X onto the weight vectors
- Why they matter: They represent the new, compressed variables that capture the most relevant information in X for predicting Y
- Similarity to other models: Similar to principal components in PCA, but optimized for prediction rather than just explaining variance
X-loadings (p)
- What they are: Coefficients that express how the original X variables contribute to each component
- Why they matter: Help interpret what each PLS component represents
- Similarity to other models: Analogous to loadings in PCA but oriented toward prediction capability
Y-weights (q)
- What they are: Directions in Y-space that have maximum covariance with X-scores
- Why they matter: They show how response variables relate to the latent components
- Similarity to other models: No direct analog in PCA; unique to two-block methods like PLS
Y-scores (u)
- What they are: Projections of Y onto the Y-weight vectors
- Why they matter: They represent a compressed version of Y that is maximally correlated with the X-scores
- Similarity to other models: No direct analog in simpler models
Inner Relationship (b)
- What it is: Regression coefficient between X-scores and Y-scores
- Why it matters: Quantifies how strongly each X-component predicts its corresponding Y-component
- Similarity to other models: Conceptually similar to regression coefficients, but operates in the latent space
How Flow Chart Looks Like
Inference in PLS: How New Data Flows Through the Model
When we want to use a trained PLS model to make predictions on new data, the data follows a specific path through the model. Let’s walk through this process step by step:
Step 1: Data Preparation
First, any new data must be preprocessed in exactly the same way as the training data: - Mean-centering using the training data means - Scaling (if applied during training) using the training data standard deviations
This ensures that the new data exists in the same mathematical space as the training data.
Step 2: Projection onto Latent Space
The centered (and possibly scaled) new data is projected onto the PLS components: - We calculate the scores for the new observations using the weight matrices from training - This projection transforms the original variables into the latent variables
The key equation is: \[\mathbf{T}_{new} = \mathbf{X}_{new,centered} \mathbf{W}(\mathbf{P}^T\mathbf{W})^{-1}\]
Where: - \(\mathbf{X}_{new,centered}\) is the centered new data - \(\mathbf{W}\) is the weight matrix from training - \(\mathbf{P}\) is the loading matrix from training - The term \((\mathbf{P}^T\mathbf{W})^{-1}\) adjusts for the non-orthogonality of the weights
Step 3: Prediction in Latent Space
Once we have the scores, we use the inner relationship to predict the response: - The scores are multiplied by the diagonal matrix of regression coefficients \(\mathbf{B}\) - This gives us the predicted scores in Y-space
Step 4: Transformation Back to Original Space
Finally, we transform the predictions back to the original Y-space: - The predicted scores are multiplied by the Y-loadings - We add back the Y means to get the final predictions
The complete prediction equation is: \[\mathbf{Y}_{pred} = \mathbf{1}\bar{\mathbf{Y}}^T + \mathbf{T}_{new}\mathbf{B}\mathbf{Q}^T\]
Where: - \(\bar{\mathbf{Y}}\) is the mean vector of the response variables - \(\mathbf{T}_{new}\) are the scores for the new data - \(\mathbf{B}\) is the diagonal matrix of regression coefficients - \(\mathbf{Q}\) is the matrix of Y-loadings
This entire process can be combined into a single regression-like equation: \[\mathbf{Y}_{pred} = \mathbf{1}\bar{\mathbf{Y}}^T + \mathbf{X}_{new,centered} \mathbf{W}(\mathbf{P}^T\mathbf{W})^{-1}\mathbf{B}\mathbf{Q}^T\]
The beauty of PLS is that this seemingly complex transformation actually simplifies the prediction process by focusing on the most relevant aspects of the data.
Walk-Through with a Concrete Example
Let’s trace the flow of data through a PLS model using a simple example with actual numbers. We’ll use a small dataset with: - 5 observations - 3 predictor variables (X) - 2 response variables (Y)
Step 0: Our Raw Data
Here’s our raw data:
X matrix (5×3):
X = [
[4.0, 2.0, 0.0],
[2.0, 5.0, 1.0],
[7.0, 3.0, 2.0],
[3.0, 4.0, 1.5],
[6.0, 1.0, 0.5]
]
Y matrix (5×2):
Y = [
[9.0, 5.0],
[7.0, 6.5],
[15.0, 9.0],
[8.5, 7.0],
[11.0, 4.5]
]
Step 1: Mean-Centering the Data
First, we calculate the means:
X_means = [4.4, 3.0, 1.0]
Y_means = [10.1, 6.4]
Then we center the data:
X₀ (centered X):
X₀ = [
[-0.4, -1.0, -1.0],
[-2.4, 2.0, 0.0],
[2.6, 0.0, 1.0],
[-1.4, 1.0, 0.5],
[1.6, -2.0, -0.5]
]
Y₀ (centered Y):
Y₀ = [
[-1.1, -1.4],
[-3.1, 0.1],
[4.9, 2.6],
[-1.6, 0.6],
[0.9, -1.9]
]
Step 2: Extract the First Component (a=1)
Initialize u₁
We use the first column of Y₀:
u₁ = [-1.1, -3.1, 4.9, -1.6, 0.9]
Calculate X-weights (w₁)
w₁ = [0.733, -0.456, 0.504]
Calculate X-scores (t₁)
t₁ = [-0.3412, -2.6712, 2.4098, -1.2302, 1.8328]
Calculate Y-weights (q₁)
q₁ = [0.952, 0.307]
Calculate new Y-scores (u₁)
u₁ = [-1.454, -2.900, 5.622, -1.337, 0.069]
Calculate X-loadings (p₁)
p₁ = [0.974, -0.552, 0.068]
Calculate inner relationship (b₁)
b₁ = 1.317
Deflate X and Y
X₁ = [
[-0.068, -1.188, -0.977],
[0.202, 0.526, 0.182],
[0.253, 1.330, 0.836],
[-0.202, 0.321, 0.584],
[-0.185, -0.988, -0.625]
]
Y₁ = [
[-0.672, -1.262],
[0.250, 1.180],
[1.878, 1.626],
[-0.057, 1.097],
[-1.398, -2.641]
]
Step 3: Extract the Second Component (a=2)
We repeat the same process with the deflated matrices X₁ and Y₁. For brevity, I’ll just provide the final results:
w₂ = [0.106, 0.813, 0.581]
t₂ = [-1.515, 0.564, 1.587, 0.545, -1.180]
p₂ = [0.106, 0.813, 0.581]
q₂ = [0.544, 0.839]
b₂ = 1.551
Step 4: Making Predictions
Now, let’s say we have a new observation:
X_new = [5.0, 2.5, 1.2]
Center the new data:
X_new_centered = [0.6, -0.5, 0.2]
Calculate scores:
T_new = X_new_centered × W(PᵀW)⁻¹
where:
W = [
[0.733, 0.106],
[-0.456, 0.813],
[0.504, 0.581]
]
P = [
[0.974, 0.106],
[-0.552, 0.813],
[0.068, 0.581]
]
Computing this:
T_new = [0.7686, -0.0074]
Predict Y:
Y_pred = Y_means + T_new × B × Qᵀ
where:
B = [
[1.317, 0],
[0, 1.551]
]
Q = [
[0.952, 0.307],
[0.544, 0.839]
]
Computing:
Y_pred = [11.058, 6.701]
So our PLS model predicts Y values of [11.058, 6.701] for the new X observation [5.0, 2.5, 1.2].
Interpretation of Our Example
In this example:
- The first component (accounting for the largest covariance between X and Y):
- Is strongly positively influenced by X variable 1 (weight 0.733)
- Is negatively influenced by X variable 2 (weight -0.456)
- Is positively influenced by X variable 3 (weight 0.504)
- Strongly predicts the first Y variable (loading 0.952) and moderately predicts the second Y variable (loading 0.307)
- The second component (capturing remaining covariance):
- Is dominated by X variables 2 and 3 (weights 0.813 and 0.581)
- Has a smaller contribution from X variable 1 (weight 0.106)
- Predicts both Y variables (loadings 0.544 and 0.839)
- The prediction process:
- New observations are projected into the PLS latent space
- These projections are used to predict Y through the inner relationship
- The predictions are then transformed back to the original Y-space
This example demonstrates how PLS creates a low-dimensional representation (we reduced from 3 X-variables to 2 components) that effectively captures the relationship between X and Y, even in this small dataset.
Summary of the Data Flow
The key insight from following this numerical example is that PLS:
- Iteratively finds directions (weights) in X-space that are most predictive of Y
- Projects data onto these directions to create scores
- Establishes relationships between these projections
- Uses these relationships to make predictions for new data
This step-by-step walkthrough with actual numbers shows how data flows through the PLS model, transforming from raw observations to predictions via a series of mathematically optimized projections.