Understanding glm

décembre 09, 2019

Let do a simple glm to explore exactly how it works:
> datax <- data.frame(y=rnorm(100), x1=rnorm(100),
+ x2=rnorm(100), x3=rnorm(100), x4=rnorm(100),
+ x5=sample(x=c("A", "B"), size=100, replace = TRUE))
> gnul <- glm(y ~ 1, data=datax)

First, let see the number of fitted parameters :
> length(gnul$coefficients)
[1] 1

But when you use logLik(), two parameters are indicated as df=:
> logLik(gnul)
'log Lik.' -153.8374 (df=2)

If you do the glm "by hand", the number of fitted parameters is 2:

> dnormx <- function(x, data) {-sum(dnorm(data, mean=x["mean"], sd=x["sd"], log = TRUE))}
> parg <- c(mean=0, sd=1)
> o0 <- optim(par = parg, fn=dnormx, data=datax[, "y"])
> o0$par
mean sd
-0.1338348 1.1270446
> o0$value
[1] 153.8374

Then first question: where the df=2 comes from?
Let take a look at the logLik() function:
> getFromNamespace("logLik.glm", ns="stats")
function (object, ...)
{
if (!missing(...))
warning("extra arguments discarded")
fam <- family(object)$family
p <- object$rank
if (fam %in% c("gaussian", "Gamma", "inverse.gaussian"))
p <- p + 1
val <- p - object$aic/2
attr(val, "nobs") <- sum(!is.na(object$residuals))
attr(val, "df") <- p
class(val) <- "logLik"
val
}

If a "gaussian", "Gamma", "inverse.gaussian" distribution is used, just it takes +1 to include the dispersion parameter !

Here, it is clear that only one parameter is used:
> gnul$df.residual
[1] 99
> gnul$df.null
[1] 99

The dispersion parameter is not counted as a fitted parameter in glm.

Where is located the sd (#dispersion parameter) in the gnul object ?
You can get the value using:
> summary(gnul)$dispersion
[1] 1.28264

Let check the summary.glm() function:
getFromNamespace("summary.glm", ns="stats")
function (object, dispersion = NULL, correlation = FALSE, symbolic.cor = FALSE,
...)
{
est.disp <- FALSE
df.r <- object$df.residual
if (is.null(dispersion))
dispersion <- if (object$family$family %in% c("poisson",
"binomial"))
1
else if (df.r > 0) {
est.disp <- TRUE
if (any(object$weights == 0))
warning("observations with zero weight not used for calculating dispersion")
sum((object$weights * object$residuals^2)[object$weights >
0])/df.r
}
else {
est.disp <- TRUE
NaN
}
... cut here

So we can see that:
> sum(gnul$residuals^2)/(nrow(datax)-length(gnul$coefficients))
[1] 1.28264
Is used to estimate the dispersion parameter.

The dispersion parameter is not counted as a parameter because it is estimated from the data.

But it is a parameter for AIC... not clear at all !

Rechercher dans ce blog

BiostatR Blog

Understanding glm

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

Install treemix in ubuntu 20.04

Multivariable analysis and correlation of iconography