Comparison of models by AIC with or without log transformation on Y

You cannot compare the AIC or BIC when fitting to two different data sets i.e. 𝑌 and 𝑍. You only can compare two models based on AIC or BIC just when fitting to the same data set. Have a look at Model Selection and Multi-model Inference: A Practical Information-theoretic Approach (Burnham and Anderson, 2004). They mentioned my answer on page 81 (section 2.11.3 Transformations of the Response Variable):

Investigators should be sure that all hypotheses are modeled using the same response variable (e.g., if the whole set of models were based on log(y), no problem would be created; it is the mixing of response variables that is incorrect).

Akaike (1978, pg. 224) describes how the AIC can be adjusted in the presence of a transformed outcome variable to enable model comparison. He states: “the effect of transforming the variable is represented simply by the multiplication of the likelihood by the corresponding Jacobian to the AIC ... for the case of log{𝑩(𝑛)+1}, it is −2 ⋅∑log{𝑩(𝑛)+1}, where the summation extends over 𝑛=1,2,...,𝑁.”

Akaike, H. 1978. "On the likelihood of a time series model," Journal of the Royal Statistical Society, Series D (The Statistician), 27(3/4), pp. 217–235.

Let do a toy example:

seedrates <- data.frame(rate = c(50, 75, 100, 125, 150), 
                        grain = c(21.2, 19.9, 19.2, 18.4, 17.9))
quad.lm <- lm(grain~poly(rate,2), data=seedrates)
loglin.lm <- lm(log(grain)~log(rate), data=seedrates)
AIC(quad.lm, loglin.lm)

We need to add sum(2*log(seedrates$grain)) = 29.6 to the AIC for the loglinear model (or, subtract it from the AIC for the quadratic model).

AIC(quad.lm, loglin.lm) + matrix(ncol=2, c(0,0,0, sum(2*log(seedrates$grain))))
          df  AIC
quad.lm    4 -4.1
loglin.lm  3 -7.6

Take a look at https://stats.stackexchange.com/questions/48714/prerequisites-for-aic-model-comparison

From Ben Bolker answer in List:

We need -2 * sum(log( d( log(x/(1-x)), "x" )))

Being super-lazy and using sympy

from sympy import *
x = Symbol("x")
simplify(diff(log(x/(1-x)), x))
## -1/(x*(x-1)) = 1/(x*(1-x))

taking -2*log() of this we get

2*sum(log(x*(1-x)))


seedrates <- data.frame(rate = c(50, 75, 100, 125, 150),
                           grain = c(21.2, 19.9, 19.2, 18.4, 17.9)) |>
   transform(pgrain = grain/100, logit_grain = qlogis(grain/100))

m0 <- lm(pgrain~1, data=seedrates)
m1 <- lm(logit_grain ~ 1, data = seedrates)

AIC(m0)
with(seedrates, AIC(m1) + 2*sum(log(pgrain*(1-pgrain))))

These are actually slightly different because of Jensen's inequality (the predicted values mean(pgrain) and plogis(mean(logit_grain)) are not quite the same), but close enough that I think the computation is done correctly. 

Commentaires

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

stepAIC from package MASS with AICc

Install treemix in ubuntu 20.04