Multivariable analysis and correlation of iconography

Introduction

Correlation iconography is a not very well known method to study multivariate data.
It was developed long-time ago by Michel Lesty:
Lesty M (1999) Une nouvelle approche dans le choix des régresseurs de la régression multiple en présence d'intéractions et de colinearités. La revue de Modulad 22:41-77
It is also well described in a French Wikipedia page:
After checking the possibilities of this method, I think that it deserves more attention.

Let take the example of the wikipedia page:
dta <- read.table(text=gsub(",", ".", "Élève Poids Âge Assiduité Note
e1 52 12 12 5
e2 59 12,5 9 5
e3 55 13 15 9
e4 58 14,5 5 5
e5 66 15,5 11 13,5
e6 62 16 15 18
e7 63 17 12 18
e8 69 18 9 18"), header=TRUE)

> dta
  Élève Poids  Âge Assiduité Note
1    e1    52 12.0        12  5.0
2    e2    59 12.5         9  5.0
3    e3    55 13.0        15  9.0
4    e4    58 14.5         5  5.0
5    e5    66 15.5        11 13.5
6    e6    62 16.0        15 18.0
7    e7    63 17.0        12 18.0
8    e8    69 18.0         9 18.0

Principal Component Analysis

First let do a classical PCA:
library(FactoMineR)
par(mar=c(4, 4, 2, 2)+0.4)
res.pca <- PCA(dta, quali.sup = 1)

What we can say from this plot?
Age, Poids and Note are strongly correlated and explained 68% of the variance and Assiduité explained 29% of the variance.
Great... but what to conclude? Not clear.

General Linear Model

It is possible to do glm analysis, but it will only answer about the relationship between one variable and others. It is too restrictive in my point of view.
Anyway, it gives some interesting information:
> g <- glm(formula = Note ~ Poids + Âge + Assiduité, data=dta)
> summary(g)

Call:
glm(formula = Note ~ Poids + Âge + Assiduité, data = dta)

Deviance Residuals: 
      1        2        3        4        5        6        7        8  
 0.3322   0.5840  -0.8702  -0.3329  -0.3567   0.3574   0.4874  -0.2013  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40.36219    3.53877 -11.406 0.000337 ***
Poids         0.16377    0.10067   1.627 0.179116    
Âge           2.20868    0.25639   8.614 0.000998 ***
Assiduité     0.83417    0.07934  10.514 0.000463 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.4631679)

    Null deviance: 263.7188  on 7  degrees of freedom
Residual deviance:   1.8527  on 4  degrees of freedom
AIC: 21.001

Number of Fisher Scoring iterations: 2
> stepAIC(g)
Start:  AIC=21
Note ~ Poids + Âge + Assiduité

            Df Deviance    AIC
<none>            1.853 21.001
- Poids      1    3.078 23.063
- Âge        1   36.224 42.785
- Assiduité  1   53.049 45.837

The p-value indicates that Poids has no effect but when Poids is removed, the prediction of the model are worst. Not easy to conclude about Poids effect! This is a classical side-effect of AIC when the number of data is on the same order of the number of effects.

Note that you are very limited in term of interactions that can be setup in the model. There are only 8 Élèves, so you cannot insert more than 8 factors including the interactions.

Correlation Iconography

The correlation iconography method is based on the comparison of correlation matrix and partial correlation matrix.

library("HelpersMG")
df <- IC_clean_data(dta, debug = TRUE)
cor_matrix <- IC_threshold_matrix(data=df, threshold = NULL, progress=FALSE)
cor_threshold <- IC_threshold_matrix(data=cor_matrix, threshold = 0.3)
par(mar=c(1,1,1,1))
set.seed(4)
plot(cor_threshold)



Let interpret the graphic:
Note is strongly linked to Âge and also to Assiduité, but less strongly.
Poids is linked to Âge but not to Note.

Rather easy to read!

Now let add the Élèves column. As this method is based on correlation, only numeric values can be introduced. New columns names "instant" are added:

instant <- matrix(rep(0, nrow(df)*nrow(df)), nrow=nrow(df))
diag(instant) <- 1
colnames(instant) <- dta[, "Élève"]

(df <- cbind(df, instant))

  Poids  Âge Assiduité Note e1 e2 e3 e4 e5 e6 e7 e8
1    52 12.0        12  5.0  1  0  0  0  0  0  0  0
2    59 12.5         9  5.0  0  1  0  0  0  0  0  0
3    55 13.0        15  9.0  0  0  1  0  0  0  0  0
4    58 14.5         5  5.0  0  0  0  1  0  0  0  0
5    66 15.5        11 13.5  0  0  0  0  1  0  0  0
6    62 16.0        15 18.0  0  0  0  0  0  1  0  0
7    63 17.0        12 18.0  0  0  0  0  0  0  1  0
8    69 18.0         9 18.0  0  0  0  0  0  0  0  1

And let do the analysis again with these "instant":

ic0 <- IC_threshold_matrix(data = df)
cor_threshold <- IC_threshold_matrix(data=ic0, threshold = 0.3)
par(mar=c(1,1,1,1))
set.seed(4)
library("igraph")

plot(IC_correlation_simplify(matrix=cor_threshold), 
     show.legend.strength = FALSE, show.legend.direction = FALSE)


The relationship between Assiduité and Note was based mainly on one Élève, e6, who was quite special. It explained why the relationship was low.

Next

Another advantage of Iconography of Correlations is that missing values are accepted.
Other methods that should be explored are Bayesian network and Permanova... later !

Commentaires

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

Install treemix in ubuntu 20.04

stepAIC from package MASS with AICc