This note is for those familiar with principal components analysis (PCA) on a multivariate set of continuous variables. The terms loadings, scores, eigenvalues, screeplot etc. should be familiar.
The concept PCA of can be extended to nominal or ordinal categorical variables. That is, transforming a large set of variables to another set, ordered by decreasing amount of variability explained. The original variables have to be transformed to a numerical scale for this to work. This is referred to as Multivariate Analysis with Optimal Scaling (MVAOS) (De Leeuw et al., 2016).
MVAOS techniques quantify factors, replacing categorical values (in R terms, levels of factors) by real numbers, so that then PCA can be applied. But, the transformations are computed along with the PCA, in order to maximize the variance explained by each component in turn, as explained in the references.
There are two kinds of categorical variables (R factors):
Ordinal variables: the order is meaningful, but the interval between them are not necessarily the same. This is typical of the Likert scale 1, in which respondents to a survey question indicate their level of agreement or disagreement with a statement on a symmetric scale. In R this is is.ordered(...)
returns TRUE
.
Nominal variables: are not ordered. The non-linear PCA procedure finds a transformation to best represent an ordering, based on the stepwise PCA. See the references for details. In R this is is.ordered(...)
returns FALSE
. Logical variables, i.e., is.logical(...)
returns TRUE
, are a special (binary) case of a nominal variable.
In this technical note we use the Gifi
R package for MVAOS There are several other packages with somewhat different approaches, see the References.
“MVAOS is a linear multivariate technique in the sense that it makes linear combinations of transformed variables, and it is a nonlinear multivariate technique in the sense that these transformations are generally nonlinear.”
The aim is to minimize a loss function:
\[ \sigma(H, A) = \sum_{\ell = 1}^L \mathrm{SSQ} (\sum_{j=1}^m H_j A_{j \ell}) \] where \(\mathrm{SSQ}\) is the unweighted sum-of-squares for each of the \(L\) equation blocks, \(H\) is the matrix of transformed variables, and \(A\) is the pattern, containing the coefficients of the linear combinations.
The Gifi
“Multivariate Analysis with Optimal Scaling” package follows the standard book (Gifi 1990). This re-implements the Gifi FORTRAN programs for R.
require("Gifi")
## Loading required package: Gifi
The sample dataset is from “ABC Customer Satisfaction” from Kenett & Salini (2012). It is a customer satisfaction survey. All questions are ordinal on a Likert scale.
Load the example data and display the variable names:
data(ABC)
names(ABC)
## [1] "satis" "satis2010" "best" "recom" "product" "equipment"
## [7] "sales" "technical" "training" "purchase" "pricing"
str(ABC)
## 'data.frame': 208 obs. of 11 variables:
## $ satis : Factor w/ 5 levels "1","2","3","4",..: 1 3 4 3 5 4 4 3 3 4 ...
## $ satis2010: Factor w/ 5 levels "1","2","3","4",..: 1 3 1 3 3 3 3 4 3 3 ...
## $ best : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ recom : Factor w/ 5 levels "1","2","3","4",..: 1 4 3 4 4 4 4 4 2 4 ...
## $ product : Factor w/ 5 levels "1","2","3","4",..: 1 4 2 4 4 4 3 4 2 3 ...
## $ equipment: Factor w/ 5 levels "1","2","3","4",..: 1 4 4 3 4 4 4 3 3 1 ...
## $ sales : Factor w/ 5 levels "1","2","3","4",..: 1 3 3 4 5 4 3 4 3 4 ...
## $ technical: Factor w/ 5 levels "1","2","3","4",..: 1 3 4 2 4 5 4 3 4 3 ...
## $ training : Factor w/ 5 levels "1","2","3","4",..: 2 3 4 3 3 3 4 3 3 4 ...
## $ purchase : Factor w/ 5 levels "1","2","3","4",..: 2 3 4 4 4 5 5 4 4 4 ...
## $ pricing : Factor w/ 5 levels "1","2","3","4",..: 1 3 3 2 3 4 3 3 2 2 ...
Notice all are R unordered factors, i.e. nominal. These are in fact ordinal, since they come from Likert scales. We will see how to treat them as both ordinal and nominal.
Make a subset with fewer variables, for demonstration:
ABC6 <- ABC[,6:11]
We compute the MVAOS PCA using Gifi::princals
. The default is for two components, treating the cateogries as ordinal, as is typical of a Likert scale. Also, the default is to treat the factors, no matter what the R data type, as ordinal.
We perform the MVAOS PCA asking for two components.
(fitord <- princals(ABC6)) ## default is ordinal=TRUE
## Call:
## princals(data = ABC6)
##
## Loss value: 0.683
## Number of iterations: 64
##
## Eigenvalues: 2.943 0.857
summary(fitord)
##
## Loadings (cutoff = 0.1):
## Comp1 Comp2
## equipment 0.784
## sales 0.639 0.611
## technical 0.674 -0.294
## training 0.673 -0.464
## purchase 0.681 -0.234
## pricing 0.741 0.355
##
## Importance (Variance Accounted For):
## Comp1 Comp2
## Eigenvalues 2.9434 0.8566
## VAF 49.0565 14.2772
## Cumulative VAF 49.0600 63.3300
Here we see the proportion of transformed variance explained by each PC, in this case about 49% for PC1, 14% for PC2. About 37% is still unexplained.
These show how the original variable is transformed for use in the PCA.
plot(fitord, plot.type = "transplot")
These are all step functions, because the default is ordinal=TRUE
.
We can see the transformations for the first few objects:
ABC6[1:6,]
## equipment sales technical training purchase pricing
## 1 1 1 1 2 2 1
## 2 4 3 3 3 3 3
## 3 4 3 4 4 4 3
## 4 3 4 2 3 4 2
## 5 4 5 4 3 4 3
## 6 4 4 5 3 5 4
fitord$transform[1:6,]
## equipment sales technical training purchase pricing
## 1 -0.17733412 -0.19817402 -0.17400949 -0.09857960 -0.12797516 -0.17009979
## 2 0.03820579 0.01201815 -0.01075632 -0.01962519 0.01464401 0.01936495
## 3 0.03820579 0.01201815 0.01295460 0.04640791 0.03538021 0.01936495
## 4 -0.02320855 0.03235367 -0.11656669 -0.01962519 0.03538021 -0.08020186
## 5 0.03820579 0.07356952 0.01295460 -0.01962519 0.03538021 0.01936495
## 6 0.03820579 0.03235367 0.07087356 -0.01962519 0.08741712 0.05786234
We can see all the loadings, including those \(< 0.1\) not shown in the model summary:
round(fitord$loadings,3)
## D1 D2
## equipment 0.784 0.020
## sales 0.639 0.611
## technical 0.674 -0.294
## training 0.673 -0.464
## purchase 0.681 -0.234
## pricing 0.741 0.355
And then plot them; these loading plots are similar to those we would see for continuous variables.
plot(fitord, "loadplot", main = "Loadings Plot ABC Data") ## aspect ratio = 1
Here, PC1 is a “size” component, i.e., more or less satisfaction with all measures. Since the arrows point the same way in the loadings plot, satisfactions on all items are positively correlated.
Here PC2 contrasts sales/pricing (positive PC2) with training/technical (negative PC2). That is, after accounting for overall satsifaction, these two aspects of satisfaction are opposite. Satisfaction with “Equipment” has no loading in PC2, it is not part of this contrast.
The biplot shows the loading vectors along with the PC scores in the space of the first two PCs. This helps find unusual observations that do not fit the overall pattern of inter-variable relations.
plot(fitord, "biplot",
main = "Biplot ABC Data")
This shows where the individual cases fall in the PC space. For example responent 135 is much more satisfied with training/techical than with sales/pricing , and is overall quite dissatisfied (far negative score for PC1).
For example:
ABC6[135,]
## equipment sales technical training purchase pricing
## 135 2 3 2 1 1 2
This respondent gave the minimum score to all variables except training
, which received a neutral score.
ABC6[51,]
## equipment sales technical training purchase pricing
## 51 3 1 5 4 4 1
This respondent was quite satisfied with technical
and training
and purchase
, which go together in PC1/2 space, but not at all satisfied with sales
and pricing
, which also go together, but are oppose to the other three in PC2 space.
This shows the relative amount of variance explained.
plot(fitord, "screeplot")
This shows that 2 components are probably all that is needed.
We can check the correlation between two factors with cross-classification tables. Here we pick two that seem highly correlated in PC1/2 space.
(t.actual <- table(ABC6$purchase, ABC6$technical))
##
## 1 2 3 4 5
## 1 1 3 2 1 1
## 2 5 8 4 14 2
## 3 1 8 10 29 14
## 4 3 7 12 35 31
## 5 1 0 1 4 11
t.purchase <- table(ABC6[,"purchase"])
t.technical <- table(ABC6[,"technical"])
(t.expected <- round(outer(t.purchase, t.technical/sum(t.purchase))))
##
## 1 2 3 4 5
## 1 0 1 1 3 2
## 2 2 4 5 13 9
## 3 3 8 9 25 18
## 4 5 11 12 35 25
## 5 1 2 2 7 5
Difference actual and expected:
print(t.actual - t.expected)
##
## 1 2 3 4 5
## 1 1 2 1 -2 -1
## 2 3 4 -1 1 -7
## 3 -2 0 1 4 -4
## 4 -2 -4 0 0 6
## 5 0 -2 -1 -3 6
Very close agreement, athough some discrepency in class 5 “highly satisfied”.
This if we can’t assume a rank.
(fitnom <- princals(ABC6, ordinal=FALSE)) ## nominal PCA
## Call:
## princals(data = ABC6, ordinal = FALSE)
##
## Loss value: 0.683
## Number of iterations: 218
##
## Eigenvalues: 2.944 0.856
summary(fitnom)
##
## Loadings (cutoff = 0.1):
## Comp1 Comp2
## equipment 0.786
## sales 0.637 -0.619
## technical -0.675 -0.283
## training 0.673 0.466
## purchase 0.682 0.221
## pricing 0.740 -0.356
##
## Importance (Variance Accounted For):
## Comp1 Comp2
## Eigenvalues 2.9442 0.8563
## VAF 49.0703 14.2713
## Cumulative VAF 49.0700 63.3400
summary(fitord)
##
## Loadings (cutoff = 0.1):
## Comp1 Comp2
## equipment 0.784
## sales 0.639 0.611
## technical 0.674 -0.294
## training 0.673 -0.464
## purchase 0.681 -0.234
## pricing 0.741 0.355
##
## Importance (Variance Accounted For):
## Comp1 Comp2
## Eigenvalues 2.9434 0.8566
## VAF 49.0565 14.2772
## Cumulative VAF 49.0600 63.3300
The loadings, eigenvalues and proportion of variance accounted for (VAF
) are not the same as for the ordinal case, although very close. Notice that the signs of the loadings for PC2 is reversed. The loss value is the same.
Notice that many more iterations were required to find a solution than were required for the ordinal case.
plot(fitnom, plot.type = "transplot", stepvec=rep(FALSE,6))
Not step functions, not necessarily monotonic (see “equipmnet”).
Look at the transformations of two of the variables:
“Technical”: “Overall satisfaction level from technical support. (1 … very low, 5 … very high)”. So here we consider it nominal, not ordinal (which it really is). And the transformation is now reversed, but the sequence remains the same.
“Equipment”: Overall satisfaction level from the equipment. (1 … very low, 5 … very high). Somehow the transformation rates “low” worse than “very low”.
We can see all the loadings, including those \(< 0.1\) not shown in the model summary:
round(fitnom$loadings,3)
## D1 D2
## equipment 0.786 0.003
## sales 0.637 -0.619
## technical -0.675 -0.283
## training 0.673 0.466
## purchase 0.682 0.221
## pricing 0.740 -0.356
And plot them:
plot(fitnom, "loadplot", main = "Loadings Plot ABC Data") ## aspect ratio = 1
PC1 is a “size” component (overall satisfaction) but “technical” is opposite this because of the transformation, which no longer preserve the order; PC2 contrasts sales/pricing with training/purchase.
plot(fitnom, "biplot", main = "Biplot ABC Data")
Again we see the anomalies, such as observations 51 and 135.
plot(fitnom, "screeplot")
This is very similar to the ordinal case.
This is a messier dataset, a random sample from a real study, of which only a few variables are reported here – also we don’t say where they came from, and sensitive information has been coded so that it can’t be related to the original value. The variables are five demographic indicators: age group, education level, employment type, marital status, and self-reported ethnicity; also car ownership status.
load("NonlinearPCA_example.Rdata", verbose=TRUE)
## Loading objects:
## ds.test
dim(ds.test)
## [1] 320 6
names(ds.test)
## [1] "AgeGroup" "Ethnicity" "Education" "Mari" "Employment"
## [6] "Car"
summary(ds.test)
## AgeGroup Ethnicity Education Mari Employment Car
## A1: 32 Eth1: 68 E1:162 Married: 97 Full_time:171 No :170
## A2:174 Eth2: 38 E2:108 Other : 16 Student : 23 Yes1: 17
## A3: 72 Eth3:131 E3: 50 Single :207 Other : 31 Yes2:133
## A4: 39 Eth4: 42 Part_time: 90
## A5: 3 Eth0: 41 Retired : 5
There are 320 survey responses and 6 variables . We have arranged AgeGroup
and Education
to have a logical order, but not the others, which are just nominal categories.
We use nonlinear PCA to see how much redundancy there is in these variables, and their inter-relation.
With so many variables we choose to start with three components.
We specify a Boolean vector for ordinal
to princcals
, with the two factors AgeGroup
and Education
as ordinal, the others as nominal.
names(ds.test)
## [1] "AgeGroup" "Ethnicity" "Education" "Mari" "Employment"
## [6] "Car"
(fitreal <- princals(ds.test, ndim=3, ordinal=c(T,F,T,F,F,F)))
## Call:
## princals(data = ds.test, ndim = 3, ordinal = c(T, F, T, F, F,
## F))
##
## Loss value: 0.766
## Number of iterations: 57
##
## Eigenvalues: 1.985 1.175 1.053
summary(fitreal)
##
## Loadings (cutoff = 0.1):
## Comp1 Comp2 Comp3
## AgeGroup 0.781 -0.220 0.337
## Education 0.620 0.495
## Employment 0.801 0.298
## Ethnicity 0.885
## Mari -0.425 0.290 0.588
## Car 0.408 -0.707
##
## Importance (Variance Accounted For):
## Comp1 Comp2 Comp3
## Eigenvalues 1.9853 1.1748 1.053
## VAF 33.0887 19.5800 17.550
## Cumulative VAF 33.0900 52.6700 70.220
70% of the (transformed variables) variance is explained. The loadings show which variables are represented by each component. For the two ordinal variables, we can interpret the loadings as “more” or “less”, although the sign is arbitary. For the nominal variables, we can’t interpret as “more” or “less”, just that the variables (after transformation) go together in this PC space.
So in PC1 AgeGroup
, Education
and Employment
go together, opposite to the transformed Mari
.
PC2 is dominated by Ethnicity
, which is absent from PC1; this has some correlation with Education
in this component.
PC3 is dominated by Car
, which is negatively related to Employment
and AgeGroup
.
Examine the details of the transformations and PCA.
plot(fitreal, plot.type = "transplot", stepvec=c(T,F,T,F,F,F))
Note the major jump in transformation of AgeGroup
after A1
(the youngest group). Note also the large discrepency between the transformed value of Student
and the other employment categories of Employment
.
plot(fitreal, plot.type = "screeplot")
It’s clear more components are needed to account for the majority of the variance. This shows that these variables are fairly independent – this would be good if they are used as predictors in a multivariate regression.
However, with 3 PC we have reached the “knick point” of the screeplot.
We can see all the loadings, including those \(< 0.1\) not shown in the model summary:
round(fitreal$loadings,3)
## D1 D2 D3
## AgeGroup 0.781 -0.220 0.337
## Ethnicity 0.029 0.885 -0.041
## Education 0.620 0.495 0.061
## Mari -0.425 0.290 0.588
## Employment 0.801 -0.090 0.298
## Car 0.408 0.084 -0.707
Then plot them:
plot(fitreal, "loadplot", main = "Loadings Plot") ## aspect ratio = 1
plot(fitreal, "loadplot", main = "Loadings Plot", plot.dim=2:3) ## aspect ratio = 1
These show graphically the numeric values of the loadings from the model summary.
plot(fitreal, "biplot", main = "Biplot transit data")
plot(fitreal, "biplot", main = "Biplot transit data", plot.dim=c(2,3))
The numbers are indivuals, too many to see all of them clearly.
Some are “unusual” in the PC space, for example:
ds.test[92,]
## AgeGroup Ethnicity Education Mari Employment Car
## 92 A1 Eth1 E1 Single Student No
A young (group A1
) single student with the least education and no car.
We can examine details of the transformation. For some interesting individuals:
The original values:
cases <- c(92, 208, 37, 207, 205, 126)
fitreal$data[cases,]
## AgeGroup Ethnicity Education Mari Employment Car
## 92 A1 Eth1 E1 Single Student No
## 208 A2 Eth1 E1 Married Student No
## 37 A2 Eth1 E3 Married Full_time Yes2
## 207 A3 Eth2 E2 Single Student Yes1
## 205 A3 Eth2 E3 Single Full_time Yes1
## 126 A2 Eth3 E1 Single Other Yes2
Converted to numbers (R factor levels):
fitreal$datanum[cases,]
## AgeGroup Ethnicity Education Mari Employment Car
## [1,] 1 1 1 3 2 1
## [2,] 2 1 1 1 2 1
## [3,] 2 1 3 1 1 3
## [4,] 3 2 2 3 2 2
## [5,] 3 2 3 3 1 2
## [6,] 2 3 1 3 3 3
The transformed values for these individuals:
fitreal$transform[cases,]
## AgeGroup Ethnicity Education Mari Employment Car
## 92 -0.16747360 -0.09710945 -0.05439992 0.03035302 -0.18398372 -0.05210407
## 208 0.01610293 -0.09710945 -0.05439992 -0.08151339 -0.18398372 -0.05210407
## 37 0.01610293 -0.09710945 0.07569963 -0.08151339 0.03339244 0.05542639
## 207 0.02243198 0.07968197 0.04655376 0.03035302 -0.18398372 0.08741075
## 205 0.02243198 0.07968197 0.07569963 0.03035302 0.03339244 0.08741075
## 126 0.01610293 0.02967712 -0.05439992 0.03035302 -0.02973029 0.05542639
The scores of each object in PC space:
fitreal$objectscores[cases,]
## D1 D2 D3
## 92 -3.1451703 -0.8350931 -0.9437769
## 208 -1.4247147 -1.9421143 -1.0066304
## 37 1.2674147 -1.1236661 -1.0671522
## 207 -0.6841850 1.8515626 -1.5617919
## 205 1.0477315 1.7677667 -0.4325850
## 126 -0.3094668 0.1831522 -0.4982306
These scores could be used as transformed variables in a regression analysis:
ds.test$D1 <- fitreal$objectscores[,"D1"]
ds.test$D2 <- fitreal$objectscores[,"D2"]
ds.test$D3 <- fitreal$objectscores[,"D3"]
Note that by design these are uncorrelated:
round(cor(ds.test$D1, ds.test$D2), 10)
## [1] 0
round(cor(ds.test$D1, ds.test$D3), 10)
## [1] 0
round(cor(ds.test$D2, ds.test$D3), 10)
## [1] 0
To see which levels of the predictors are correlated in bivariate space, we use a cross-classification matrix.
For example: ethnicity vs. education level:
# actual
(t.ee <- table(ds.test[,"Ethnicity"], ds.test[,"Education"]))
##
## E1 E2 E3
## Eth1 46 16 6
## Eth2 16 13 9
## Eth3 50 51 30
## Eth4 22 18 2
## Eth0 28 10 3
The expected values are computed from the marginal totals:
# expected
t.eth <- table(ds.test[,"Ethnicity"])
t.edu <- table(ds.test[,"Education"])
(t.ee.expected <- round(outer(t.eth, t.edu)/sum(t.edu)))
##
## E1 E2 E3
## Eth1 34 23 11
## Eth2 19 13 6
## Eth3 66 44 20
## Eth4 21 14 7
## Eth0 21 14 6
Difference actual and expected:
(t.ee - t.ee.expected)
##
## E1 E2 E3
## Eth1 12 -7 -5
## Eth2 -3 0 3
## Eth3 -16 7 10
## Eth4 1 4 -5
## Eth0 7 -4 -3
There is quite some difference. In this case, the actual values show a lack of independence. Specifically, Eth1
have more than expected education status E1
, whereas Eth3
have fewer. So these are not independent. We saw in the loadings plots that these two variables went together in PC2, but not the reason for it.
Test whether the expected is different from the actual by a \(\chi^2\) test:
# different?
chisq.test(t.ee)
##
## Pearson's Chi-squared test
##
## data: t.ee
## X-squared = 29.266, df = 8, p-value = 0.0002848
As expected, rejecting the null hypothesis of independence is quite unlikely to be an error.
Gifi, A. (1990). Nonlinear Multivariate Analysis. Wiley.
De Leeuw, J., Mair, P., & Groenen, P. (2016, September 26). Multivariate Analysis with Optimal Scaling. https://bookdown.org/jandeleeuw6/gifi/
Kenett, R. S., & Salini, S. (2012). Modern Analysis of Customer Surveys with Applications in R. New York: Wiley
Linting, M., Meulman, J. J., Groenen, P. J. F., & van der Kooij, A. J. (2007). Nonlinear principal components analysis: Introduction and application. Psychological Methods, 12(3), 336–358. https://doi.org/10.1037/1082-989X.12.3.336
Some packages not used here:
Le, S., Josse, J., & Husson, F. (2008). FactoMineR
: An R package for multivariate analysis. Journal of Statistical Software, 25(1), 1–18.
Mair, P., & de Leeuw, J. (2010). A general framework for multivariate analysis with optimal scaling: The R package aspect
. Journal of Statistical Software, 32(9), 1–23.
de Leeuw, J., & Mair, P. (2009). Gifi Methods for Optimal Scaling in R: The Package homals
. Journal of Statistical Software, 31(4), 1–21.