1. Behavioral PLS¶
Running a behavioral PLS using pyls
is as simple as:
>>> import pyls
>>> out = pyls.behavioral_pls(X, Y)
What we call behavioral PLS in the pyls
package is actually the more
traditional form of PLS (and is generally not prefixed with “behavioral”). This
form of PLS, at its core, attempts to find shared information between two sets
of features derived from a common set of samples. However, as with all things,
there are a number of ever-so-slightly different kinds of PLS that exist in the
wild, so to be thorough we’re going to briefly explain the exact flavor
implemented here before diving into a more illustrative example.
1.1. What exactly do we mean by “behavioral PLS”?¶
Technical answer: pyls.behavioral_pls()
employs a symmetrical,
singular value decomposition (SVD) based form of PLS, and is sometimes referred
to as PLS-correlation (PLS-C), PLS-SVD, or, infrequently, EZ-PLS. Notably, it
is not the same as PLS regression (PLS-R).
Less technical answer: pyls.behavioral_pls()
is like performing a
principal components analysis (PCA) but when you have two related datasets,
each with multiple features.
1.1.1. Differences from PLS regression (PLS-R)¶
You can think of the differences between PLS-C and PLS-R similar to how you might consider the differences between a Pearson correlation and a simple linear regression. Though this analogy is an over-simplification, the primary difference to take away is that behavioral PLS (PLS-C) does not assess directional relationships between sets of data (e.g., X → Y), but rather looks at how the two sets generally covary (e.g., X ↔ Y).
To understand this a bit more we can walk through a detailed example.
1.2. An exercise in calisthenics¶
Note
Descriptions of PLS are almost always accompanied by a litany of equations, and for good reason: understanding how to interpret the results of a PLS requires at least a cursory understanding of the math behind it. As such, this example is going to rely on these equations, but will always do so in the context of real data. The hope is that this approach will help make the more abstract mathematical concepts a bit more concrete (and easier to apply to new data sets!).
We’ll start by loading the example dataset 1:
>>> from pyls.examples import load_dataset
>>> data = load_dataset('linnerud')
This is the same dataset as in sklearn.datasets.load_linnerud()
; the
formatting has just been lightly modified to better suit our purposes.
Our data
object can be treated as a dictionary, containing all the
information necessary to run a PLS analysis. The keys can be accessed as
attributes, so we can take a quick look at our input matrices
\(\textbf{X}\) and \(\textbf{Y}\):
>>> sorted(data.keys())
['X', 'Y', 'n_boot', 'n_perm']
>>> data.X.shape
(20, 3)
>>> data.X.head()
Chins Situps Jumps
0 5.0 162.0 60.0
1 2.0 110.0 60.0
2 12.0 101.0 101.0
3 12.0 105.0 37.0
4 13.0 155.0 58.0
The rows of our \(\textbf{X}_{n \times p}\) matrix here represent n subjects, and the columns indicate p different types of exercises these subjects were able to perform. So the first subject was able to do 5 chin-ups, 162 situps, and 60 jumping jacks.
>>> data.Y.shape
(20, 3)
>>> data.Y.head()
Weight Waist Pulse
0 191.0 36.0 50.0
1 189.0 37.0 52.0
2 193.0 38.0 58.0
3 162.0 35.0 62.0
4 189.0 35.0 46.0
The rows of our \(\textbf{Y}_{n \times q}\) matrix also represent n subjects (critically, the same subjects as in \(\textbf{X}\)), and the columns indicate q physiological measurements taken for each subject. That same subject referenced above thus has a weight of 191 pounds, a 36 inch waist, and a pulse of 50 beats per minute.
Behavioral PLS will attempt to establish whether a relationship exists between the exercises performed and these physiological variables. If we wanted to run the full analysis right away, we could do so with:
>>> from pyls import behavioral_pls
>>> results = behavioral_pls(**data)
If you’re comfortable with the down-and-dirty of PLS and want to go ahead and
start understanding the results
object, feel free to jump ahead to
PLS Results. Otherwise, read on for more about what’s happening behind
the scenes of behavioral_pls()
1.3. The cross-covariance matrix¶
Behavioral PLS works by decomposing the cross-covariance matrix
\(\textbf{R}_{q \times p}\) generated from the input matrices, where
\(\textbf{R} = \textbf{Y}^{T} \textbf{X}\). The results of PLS are a
bit easier to interpret when \(\textbf{R}\) is the cross-correlation matrix
instead of the cross-covariance matrix, which means that we should z-score each
feature in \(\textbf{X}\) and \(\textbf{Y}\) before multiplying them;
this is done automatically by the behavioral_pls()
function.
In our example, \(\textbf{R}\) ends up being a 3 x 3 matrix:
>>> from pyls.compute import xcorr
>>> R = xcorr(data.X, data.Y)
>>> R
Chins Situps Jumps
Weight -0.389694 -0.493084 -0.226296
Waist -0.552232 -0.645598 -0.191499
Pulse 0.150648 0.225038 0.034933
The \(q\) rows of this matrix correspond to the physiological measurements
and the \(p\) columns to the exercises. Examining the first row, we can see
that -0.389694
is the correlation between Weight
and Chins
across
all the subjects, -0.493084
the correlation between Weight
and
Situps
, and so on.
1.4. Singular value decomposition¶
Once we have generated our correlation matrix \(\textbf{R}\) we subject it to a singular value decomposition, where \(\textbf{R} = \textbf{USV}^{T}\):
>>> from pyls.compute import svd
>>> U, S, V = svd(R)
>>> U.shape, S.shape, V.shape
((3, 3), (3, 3), (3, 3))
The outputs of this decomposition are two arrays of left and right singular vectors (\(\textbf{U}_{p \times l}\) and \(\textbf{V}_{q \times l}\)) and a diagonal matrix of singular values (\(\textbf{S}_{l \times l}\)). The rows of \(\textbf{U}\) correspond to the exercises from our input matrix \(\textbf{X}\), and the rows of \(\textbf{V}\) correspond to the physiological measurements from our input matrix \(\textbf{Y}\). The columns of \(\textbf{U}\) and \(\textbf{V}\), on the other hand, represent new dimensions or components that have been “discovered” in the data.
The \(i^{th}\) columns of \(\textbf{U}\) and \(\textbf{V}\) weigh the contributions of these exercises and physiological measurements, respectively. Taken together, the \(i^{th}\) left and right singular vectors and singular value represent a latent variable, a multivariate pattern that weighs the original exercise and physiological measurements such that they maximally covary with each other.
The \(i^{th}\) singular value is proportional to the total exercise-physiology covariance accounted for by the latent variable. The effect size (\(\eta\)) associated with a particular latent variable can be estimated as the ratio of the squared singular value (\(\sigma\)) to the sum of all the squared singular values:
We can use the helper function pyls.compute.varexp()
to calculate this
for us:
>>> from pyls.compute import varexp
>>> pctvar = varexp(S)[0, 0]
>>> print('{:.4f}'.format(pctvar))
0.9947
Taking a look at the variance explained, we see that a whopping ~99.5% of the covariance between the exercises and physiological measurements in \(\textbf{X}\) and \(\textbf{Y}\) are explained by this latent variable, suggesting that the relationship between these variable can be effectively explained by a single dimension.
Examining the weights from the singular vectors:
>>> U[:, 0]
array([0.61330742, 0.7469717 , 0.25668519])
>>> V[:, 0]
array([-0.58989118, -0.77134059, 0.23887675])
we see that all the exercises (U[:, 0]
) are positively weighted, but that
the physiological measurements (V[:, 0]
) are split, with Weight
and
Waist
measurements negatively weighted and Pulse
positively weighted.
(Note that the order of the weights is the same as the order of the original
columns in our \(\textbf{X}\) and \(\textbf{Y}\) matrices.) Taken
together this suggests that, for the subjects in this dataset, individuals who
completed more of a given exercise tended to:
Complete more of the other exercises, and
Have a lower weight, smaller waist, and higher heart rate.
It is also worth examining how correlated the projections of the original variables on this latent variable are. To do that, we can multiply the original data matrices by the relevant singular vectors and then correlate the results:
>>> from scipy.stats import pearsonr
>>> XU = np.dot(data.X, U)
>>> YV = np.dot(data.Y, V)
>>> r, p = pearsonr(XU[:, 0], YV[:, 0])
>>> print('r = {:.4f}, p = {:.4f}'.format(r, p))
r = 0.4900, p = 0.0283
The correlation value of this latent variable (~ 0.49
) suggests that our
interpretation of the singular vectors weights, above, is only somewhat
accurate. We can think of this correlation (ranging from -1 to 1) as a proxy
for the question: “how often is this interpretation of the singular vectors
true?” Correlations closer to -1 indicate that the interpretation is largely
inaccurate across subjects, whereas correlations closer to 1 indicate the
interpretation is largely accurate across subjects.
1.5. Latent variable significance testing¶
Scientists love null-hypothesis significance testing, so there’s a strong urge for researchers doing these sorts of analyses to want to find a way to determine whether observed latent variables are significant(ly different from a specified null model). The issue comes in determining what aspect of the latent variables to test!
With behavioral PLS we assess whether the variance explained by a given latent variable is significantly different than would be expected by a null. Importantly, that null is generated by re-computing the latent variables from random permutations of the original data, generating a non-parametric distribution of explained variances by which to measure “significance.”