R2 for binomial data is sensitive on the grouping scheme

 Try to use these data to calculate Nagelkerke R2:

library(fmsb)
data_grouped <- data.frame(x=c(3, 4, 5), y=c(2, 1, 0))
cof_grouped <- c(10, 7, 2)
res_grouped <- glm(cbind(x,y) ~ cof_grouped, data=data_grouped, family=binomial())
summary(res_grouped)
or

data_splited <- data.frame(x=c(1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1), 
                           y=1-c(1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1))
cof_splited <- rep(c(10, 7, 2), each=5)
res_splited <- glm(cbind(x,y) ~ cof_splited, data=data_splited, family=binomial())
summary(res_splited)

These data are exactly the same and indeed the fitted model is the same.

But...

> NagelkerkeR2(res_grouped)$R2

[1] 0.9538362

> NagelkerkeR2(res_splited)$R2

[1] 0.2879462

You will conclude for a very strong link and a weaker link in the second case.

Note that you have the same problem with a common R2:

> cor(x = data_grouped$x/(data_grouped$x+data_grouped$y), y=res_grouped$fitted.values)^2

[1] 0.964555

> cor(x = data_splited$x/(data_splited$x+data_splited$y), y=res_splited$fitted.values)^2

[1] 0.1607592


This problem is also indicated here:

https://thestatsgeek.com/2014/02/08/r-squared-in-logistic-regression/

The conclusion of the author is:

The low R squared for the individual binary data model reflects the fact that the covariate x does not enable accurate prediction of the individual binary outcomes. In contrast, x can give a good prediction for the number of successes in a large group of individuals.

See also:

Mittlböck M, Heinzl H (2001) A note on R2 measures for Poisson and logistic regression models when both models are applicable. Journal of Clinical Epidemiology 54: 99-103 doi 10.1016/S0895-4356(00)00292-4


Commentaires

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

stepAIC from package MASS with AICc

Install treemix in ubuntu 20.04