Understanding glm
Let do a simple glm to explore exactly how it works:
> datax <- data.frame(y=rnorm(100), x1=rnorm(100),
+ x2=rnorm(100), x3=rnorm(100), x4=rnorm(100),
+ x5=sample(x=c("A", "B"), size=100, replace = TRUE))
> gnul <- glm(y ~ 1, data=datax)
First, let see the number of fitted parameters :
> length(gnul$coefficients)
[1] 1
But when you use logLik(), two parameters are indicated as df=:
> logLik(gnul)
'log Lik.' -153.8374 (df=2)
If you do the glm "by hand", the number of fitted parameters is 2:
> dnormx <- function(x, data) {-sum(dnorm(data, mean=x["mean"], sd=x["sd"], log = TRUE))}
> parg <- c(mean=0, sd=1)
> o0 <- optim(par = parg, fn=dnormx, data=datax[, "y"])
> o0$par
mean sd
-0.1338348 1.1270446
> o0$value
[1] 153.8374
Then first question: where the df=2 comes from?
Let take a look at the logLik() function:
> getFromNamespace("logLik.glm", ns="stats")
function (object, ...)
{
if (!missing(...))
warning("extra arguments discarded")
fam <- family(object)$family
p <- object$rank
if (fam %in% c("gaussian", "Gamma", "inverse.gaussian"))
p <- p + 1
val <- p - object$aic/2
attr(val, "nobs") <- sum(!is.na(object$residuals))
attr(val, "df") <- p
class(val) <- "logLik"
val
}
If a "gaussian", "Gamma", "inverse.gaussian" distribution is used, just it takes +1 to include the dispersion parameter !
Here, it is clear that only one parameter is used:
> gnul$df.residual
[1] 99
> gnul$df.null
[1] 99
The dispersion parameter is not counted as a fitted parameter in glm.
Where is located the sd (#dispersion parameter) in the gnul object ?
You can get the value using:
> summary(gnul)$dispersion
[1] 1.28264
Let check the summary.glm() function:
getFromNamespace("summary.glm", ns="stats")
function (object, dispersion = NULL, correlation = FALSE, symbolic.cor = FALSE,
...)
{
est.disp <- FALSE
df.r <- object$df.residual
if (is.null(dispersion))
dispersion <- if (object$family$family %in% c("poisson",
"binomial"))
1
else if (df.r > 0) {
est.disp <- TRUE
if (any(object$weights == 0))
warning("observations with zero weight not used for calculating dispersion")
sum((object$weights * object$residuals^2)[object$weights >
0])/df.r
}
else {
est.disp <- TRUE
NaN
}
... cut here
So we can see that:
> sum(gnul$residuals^2)/(nrow(datax)-length(gnul$coefficients))
[1] 1.28264
Is used to estimate the dispersion parameter.
The dispersion parameter is not counted as a parameter because it is estimated from the data.
But it is a parameter for AIC... not clear at all !
> datax <- data.frame(y=rnorm(100), x1=rnorm(100),
+ x2=rnorm(100), x3=rnorm(100), x4=rnorm(100),
+ x5=sample(x=c("A", "B"), size=100, replace = TRUE))
> gnul <- glm(y ~ 1, data=datax)
First, let see the number of fitted parameters :
> length(gnul$coefficients)
[1] 1
But when you use logLik(), two parameters are indicated as df=:
> logLik(gnul)
'log Lik.' -153.8374 (df=2)
If you do the glm "by hand", the number of fitted parameters is 2:
> dnormx <- function(x, data) {-sum(dnorm(data, mean=x["mean"], sd=x["sd"], log = TRUE))}
> parg <- c(mean=0, sd=1)
> o0 <- optim(par = parg, fn=dnormx, data=datax[, "y"])
> o0$par
mean sd
-0.1338348 1.1270446
> o0$value
[1] 153.8374
Then first question: where the df=2 comes from?
Let take a look at the logLik() function:
> getFromNamespace("logLik.glm", ns="stats")
function (object, ...)
{
if (!missing(...))
warning("extra arguments discarded")
fam <- family(object)$family
p <- object$rank
if (fam %in% c("gaussian", "Gamma", "inverse.gaussian"))
p <- p + 1
val <- p - object$aic/2
attr(val, "nobs") <- sum(!is.na(object$residuals))
attr(val, "df") <- p
class(val) <- "logLik"
val
}
If a "gaussian", "Gamma", "inverse.gaussian" distribution is used, just it takes +1 to include the dispersion parameter !
Here, it is clear that only one parameter is used:
> gnul$df.residual
[1] 99
> gnul$df.null
[1] 99
The dispersion parameter is not counted as a fitted parameter in glm.
Where is located the sd (#dispersion parameter) in the gnul object ?
You can get the value using:
> summary(gnul)$dispersion
[1] 1.28264
Let check the summary.glm() function:
getFromNamespace("summary.glm", ns="stats")
function (object, dispersion = NULL, correlation = FALSE, symbolic.cor = FALSE,
...)
{
est.disp <- FALSE
df.r <- object$df.residual
if (is.null(dispersion))
dispersion <- if (object$family$family %in% c("poisson",
"binomial"))
1
else if (df.r > 0) {
est.disp <- TRUE
if (any(object$weights == 0))
warning("observations with zero weight not used for calculating dispersion")
sum((object$weights * object$residuals^2)[object$weights >
0])/df.r
}
else {
est.disp <- TRUE
NaN
}
... cut here
So we can see that:
> sum(gnul$residuals^2)/(nrow(datax)-length(gnul$coefficients))
[1] 1.28264
Is used to estimate the dispersion parameter.
The dispersion parameter is not counted as a parameter because it is estimated from the data.
But it is a parameter for AIC... not clear at all !
Commentaires
Enregistrer un commentaire