Multivariable analysis and correlation of iconography
Introduction
Correlation iconography is a not very well known method to study multivariate data.It was developed long-time ago by Michel Lesty:
Lesty M (1999) Une nouvelle approche dans le choix des régresseurs de la régression multiple en présence d'intéractions et de colinearités. La revue de Modulad 22:41-77
It is also well described in a French Wikipedia page:
After checking the possibilities of this method, I think that it deserves more attention.
Let take the example of the wikipedia page:
dta <- read.table(text=gsub(",", ".", "Élève Poids Âge Assiduité Note
e1 52 12 12 5
e2 59 12,5 9 5
e3 55 13 15 9
e4 58 14,5 5 5
e5 66 15,5 11 13,5
e6 62 16 15 18
e7 63 17 12 18
e8 69 18 9 18"), header=TRUE)
> dta
Élève Poids Âge Assiduité Note
1 e1 52 12.0 12 5.0
2 e2 59 12.5 9 5.0
3 e3 55 13.0 15 9.0
4 e4 58 14.5 5 5.0
5 e5 66 15.5 11 13.5
6 e6 62 16.0 15 18.0
7 e7 63 17.0 12 18.0
8 e8 69 18.0 9 18.0
Principal Component Analysis
First let do a classical PCA:
library(FactoMineR)
par(mar=c(4, 4, 2, 2)+0.4)
res.pca <- PCA(dta, quali.sup = 1)
What we can say from this plot?
Age, Poids and Note are strongly correlated and explained 68% of the variance and Assiduité explained 29% of the variance.
Great... but what to conclude? Not clear.
General Linear Model
It is possible to do glm analysis, but it will only answer about the relationship between one variable and others. It is too restrictive in my point of view.
Anyway, it gives some interesting information:
> g <- glm(formula = Note ~ Poids + Âge + Assiduité, data=dta)
> summary(g)
Call:
glm(formula = Note ~ Poids + Âge + Assiduité, data = dta)
Deviance Residuals:
1 2 3 4 5 6 7 8
0.3322 0.5840 -0.8702 -0.3329 -0.3567 0.3574 0.4874 -0.2013
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40.36219 3.53877 -11.406 0.000337 ***
Poids 0.16377 0.10067 1.627 0.179116
Âge 2.20868 0.25639 8.614 0.000998 ***
Assiduité 0.83417 0.07934 10.514 0.000463 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.4631679)
Null deviance: 263.7188 on 7 degrees of freedom
Residual deviance: 1.8527 on 4 degrees of freedom
AIC: 21.001
Number of Fisher Scoring iterations: 2
> stepAIC(g)
Start: AIC=21
Note ~ Poids + Âge + Assiduité
Df Deviance AIC
<none> 1.853 21.001
- Poids 1 3.078 23.063
- Âge 1 36.224 42.785
- Assiduité 1 53.049 45.837
The p-value indicates that Poids has no effect but when Poids is removed, the prediction of the model are worst. Not easy to conclude about Poids effect! This is a classical side-effect of AIC when the number of data is on the same order of the number of effects.
Note that you are very limited in term of interactions that can be setup in the model. There are only 8 Élèves, so you cannot insert more than 8 factors including the interactions.
Correlation Iconography
The correlation iconography method is based on the comparison of correlation matrix and partial correlation matrix.
library("HelpersMG")
df <- IC_clean_data(dta, debug = TRUE)
cor_matrix <- IC_threshold_matrix(data=df, threshold = NULL, progress=FALSE)
cor_threshold <- IC_threshold_matrix(data=cor_matrix, threshold = 0.3)
par(mar=c(1,1,1,1))
set.seed(4)
plot(cor_threshold)
Let interpret the graphic:
Note is strongly linked to Âge and also to Assiduité, but less strongly.
Poids is linked to Âge but not to Note.
Rather easy to read!
Now let add the Élèves column. As this method is based on correlation, only numeric values can be introduced. New columns names "instant" are added:
instant <- matrix(rep(0, nrow(df)*nrow(df)), nrow=nrow(df))
diag(instant) <- 1
colnames(instant) <- dta[, "Élève"]
(df <- cbind(df, instant))
Poids Âge Assiduité Note e1 e2 e3 e4 e5 e6 e7 e8
1 52 12.0 12 5.0 1 0 0 0 0 0 0 0
2 59 12.5 9 5.0 0 1 0 0 0 0 0 0
3 55 13.0 15 9.0 0 0 1 0 0 0 0 0
4 58 14.5 5 5.0 0 0 0 1 0 0 0 0
5 66 15.5 11 13.5 0 0 0 0 1 0 0 0
6 62 16.0 15 18.0 0 0 0 0 0 1 0 0
7 63 17.0 12 18.0 0 0 0 0 0 0 1 0
8 69 18.0 9 18.0 0 0 0 0 0 0 0 1
And let do the analysis again with these "instant":
ic0 <- IC_threshold_matrix(data = df)
cor_threshold <- IC_threshold_matrix(data=ic0, threshold = 0.3)
par(mar=c(1,1,1,1))
set.seed(4)
library("igraph")
plot(IC_correlation_simplify(matrix=cor_threshold),
show.legend.strength = FALSE, show.legend.direction = FALSE)
The relationship between Assiduité and Note was based mainly on one Élève, e6, who was quite special. It explained why the relationship was low.
Next
Another advantage of Iconography of Correlations is that missing values are accepted.
Other methods that should be explored are Bayesian network and Permanova... later !
Commentaires
Enregistrer un commentaire