Understanding glm

Let do a simple glm to explore exactly how it works:
> datax <- data.frame(y=rnorm(100), x1=rnorm(100),
+                     x2=rnorm(100), x3=rnorm(100), x4=rnorm(100), 
+                     x5=sample(x=c("A", "B"), size=100, replace = TRUE))
> gnul <- glm(y ~ 1, data=datax)

First, let see the number of fitted parameters :
> length(gnul$coefficients)
[1] 1

But when you use logLik(), two parameters are indicated as df=:
> logLik(gnul)
'log Lik.' -153.8374 (df=2)

If you do the glm "by hand", the number of fitted parameters is 2:

> dnormx <- function(x, data) {-sum(dnorm(data, mean=x["mean"], sd=x["sd"], log = TRUE))}
> parg <- c(mean=0, sd=1)
> o0 <- optim(par = parg, fn=dnormx, data=datax[, "y"])
> o0$par
      mean         sd 
-0.1338348  1.1270446 
> o0$value
[1] 153.8374

Then first question: where the df=2 comes from?
Let take a look at the logLik() function:
> getFromNamespace("logLik.glm", ns="stats")
function (object, ...) 
{
    if (!missing(...)) 
        warning("extra arguments discarded")
    fam <- family(object)$family
    p <- object$rank
    if (fam %in% c("gaussian", "Gamma", "inverse.gaussian")) 
        p <- p + 1
    val <- p - object$aic/2
    attr(val, "nobs") <- sum(!is.na(object$residuals))
    attr(val, "df") <- p
    class(val) <- "logLik"
    val
}

If a "gaussian", "Gamma", "inverse.gaussian" distribution is used, just it takes +1 to include the dispersion parameter !

Here, it is clear that only one parameter is used:
> gnul$df.residual
[1] 99
> gnul$df.null
[1] 99

The dispersion parameter is not counted as a fitted parameter in glm.

Where is located the sd (#dispersion parameter) in the gnul object ?
You can get the value using:
> summary(gnul)$dispersion
[1] 1.28264

Let check the summary.glm() function:
getFromNamespace("summary.glm", ns="stats")
function (object, dispersion = NULL, correlation = FALSE, symbolic.cor = FALSE, 
    ...) 
{
    est.disp <- FALSE
    df.r <- object$df.residual
    if (is.null(dispersion)) 
        dispersion <- if (object$family$family %in% c("poisson", 
            "binomial")) 
            1
        else if (df.r > 0) {
            est.disp <- TRUE
            if (any(object$weights == 0)) 
                warning("observations with zero weight not used for calculating dispersion")
            sum((object$weights * object$residuals^2)[object$weights > 
                0])/df.r
        }
        else {
            est.disp <- TRUE
            NaN
        }
... cut here

So we can see that:
>  sum(gnul$residuals^2)/(nrow(datax)-length(gnul$coefficients))
[1] 1.28264
Is used to estimate the dispersion parameter.

The dispersion parameter is not counted as a parameter because it is estimated from the data.

But it is a parameter for AIC... not clear at all !








Commentaires

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

stepAIC from package MASS with AICc

Install treemix in ubuntu 20.04