The mess of linear regressions
Everyone has done a linear regression... but not everyone knows really how to do a linear regression. There are several methods which produce different outputs. Here is a little synthesis:
# Let generate data. Note the the scale for x and y are different.
# the set.seed(1) makes sure that the data are always the same
# to make reproduced examples
set.seed(1)
x <- 1:10
y <- 10*x+rnorm(10, 5, 20)
plot(x, y, bty="n", las=1)
# First method. Only error on y axis is supposed
# The square of the residuals on y axis is minimized
# Why the sum of square? It is grounded on the likelihood of Gaussian distribution for residuals
lm(y~ x)
abline(lm(y~ x), col="red")
# second, very similar to previous one, only error on x axis is supposed
# The square of the residuals on x axis is minimized
ir <- coef(lm(x~y))
print(c(-ir[1]/ir[2],1/ir[2]))
abline(-ir[1]/ir[2],1/ir[2], col="blue")
# The choice between error inputed on y or x is important.
# The error must be inputed on the dependent variable, ie the one that is not controlled by data collector. Note however that there are two kinds of dependent variable: those that are simply measured (for example sex) and those which are controlled (for example the quantity of light changed by experimenter)
# However, in many cases (for example in allometry), both x and y variables are dependent. Then least-square methods are not adequate. Furthermore, the quality of measurement for independent variable is not always perfect and error should be taken into account.
# There are several solutions
# Here error on both x and y axis are allowed.
# The product of the residual for x and y axis are minimized
# it is equivalent to minimized the surface of the rectangle
# from the point to the slope
frect <- function(x,y){ # droite des moindres rectangles
b <- sqrt(var(y)/var(x))*sign(cor(x,y))
c(mean(y)-b*mean(x),b)
}
frect(x, y)
abline(frect(x, y), col="green")
# Another solution is when the orthogonal distance of the residuals to the slope are minimized
# Note that least rectangle regression is the same than orthogonal regression when the variance of x and y are the same.
# Then orthogonal regression can be seen as a least rectangle regression for standardized variables
library("pracma")
odregress(x, y)$coeff
abline(rev(odregress(x, y)$coeff), col="yellow")
# Same here but the ratio of standard error of residuals is known
library("MethComp")
# vr The assumed known ratio of the (residual) variance of the ys relative to that of the xs. Defaults to 1.
# sdr do. for standard deviations. Defaults to 1
Deming(x, y)
abline(Deming(x, y)[1:2], col="black")
# here it is supposed proportional to the mean
mean(x)
mean(y)
Deming(x, y, sdr=mean(y)/mean(x))
abline(Deming(x, y, sdr=mean(y)/mean(x))[1:2], col="purple")
# If the ratio is huge, it means that only error in y axis is taken into account
# it is nearly the same as lm(y~ x)
abline(Deming(x, y, sdr=1000000)[1:2], col="black")
######## a robust version of lm
# from Wikipedia:
# Robust statistics are statistics with good performance for data
# drawn from a wide range of probability distributions,
# especially for distributions that are not normally distributed.
library("robustbase")
lr <- lmrob(y ~ x)
abline(lr, col= "black ", lty=2)
legend("bottomright", col=c("red", "blue", "green", "yellow",
"black", "purple", "black", "black"),
lty=c(1, 1, 1, 1, 1, 1, 1, 2),
legend=c("error on y", "error on x", "error x and y",
"orthogonal", "Known ratio sd y/sd x", "sd # mean",
"High sd y/sd x", "robust"))
########################################
### what about the standard errors of parameters ###
########################################
library(sandwich)
l <- lm(y ~ x)
# Normal Variance-Covariance Matrix for a Fitted Model Object
sqrt(diag(vcov(l)))
# Robust method to estimate the standard error of parameters
# Heteroskedasticity and Autocorrelation Consistent (HAC)
# Covariance Matrix Estimation
# it can be used also for a glm object
sqrt(diag(vcovHAC(l)))
Commentaires
Enregistrer un commentaire