The mess of linear regressions


Everyone has done a linear regression... but not everyone knows really how to do a linear regression. There are several methods which produce different outputs. Here is a little synthesis:

# Let generate data. Note the the scale for x and y are different.
# the set.seed(1) makes sure that the data are always the same
# to make reproduced examples
set.seed(1)
x <- 1:10
y <- 10*x+rnorm(10, 5, 20)
plot(x, y, bty="n", las=1)

# First method. Only error on y axis is supposed
# The square of the residuals on y axis is minimized
# Why the sum of square? It is grounded on the likelihood of Gaussian distribution for residuals
lm(y~ x)
abline(lm(y~ x), col="red")

# second, very similar to previous one, only error on x axis is supposed
# The square of the residuals on x axis is minimized
ir <- coef(lm(x~y))
print(c(-ir[1]/ir[2],1/ir[2]))
abline(-ir[1]/ir[2],1/ir[2], col="blue")

# The choice between error inputed on y or x is important.
# The error must be inputed on the dependent variable, ie the one that is not controlled by data collector. Note however that there are two kinds of dependent variable: those that are simply measured (for example sex) and those which are controlled (for example the quantity of light changed by experimenter)
# However, in many cases (for example in allometry), both x and y variables are dependent. Then least-square methods are not adequate. Furthermore, the quality of measurement for independent variable is not always perfect and error should be taken into account.

# There are several solutions

# Here error on both x and y axis are allowed.
# The product of the residual for x and y axis are minimized
# it is equivalent to minimized the surface of the rectangle
# from the point to the slope
frect <- function(x,y){ # droite des moindres rectangles 
  b <- sqrt(var(y)/var(x))*sign(cor(x,y)) 
  c(mean(y)-b*mean(x),b) 
}
frect(x, y)
abline(frect(x, y), col="green")

# Another solution is when the orthogonal distance of the residuals to the slope are minimized
# Note that least rectangle regression is the same than orthogonal regression when the variance of x and y are the same.
# Then orthogonal regression can be seen as a least rectangle regression for standardized variables
library("pracma")
odregress(x, y)$coeff
abline(rev(odregress(x, y)$coeff), col="yellow")

# Same here but the ratio of standard error of residuals is known
library("MethComp")
# vr  The assumed known ratio of the (residual) variance of the ys relative to that of the xs. Defaults to 1.
# sdr do. for standard deviations. Defaults to 1
Deming(x, y)
abline(Deming(x, y)[1:2], col="black")
# here it is supposed proportional to the mean
mean(x)
mean(y)
Deming(x, y, sdr=mean(y)/mean(x))
abline(Deming(x, y, sdr=mean(y)/mean(x))[1:2], col="purple")
# If the ratio is huge, it means that only error in y axis is taken into account
# it is nearly the same as lm(y~ x)
abline(Deming(x, y, sdr=1000000)[1:2], col="black")

######## a robust version of lm
# from Wikipedia:
# Robust statistics are statistics with good performance for data 
#    drawn from a wide range of probability distributions, 
#    especially for distributions that are not normally distributed.

library("robustbase")
lr <- lmrob(y ~ x)
abline(lr, col= "black ", lty=2) 

legend("bottomright", col=c("red", "blue", "green", "yellow", 
        "black", "purple", "black", "black"), 
       lty=c(1, 1, 1, 1, 1, 1, 1, 2), 
       legend=c("error on y", "error on x", "error x and y", 
       "orthogonal", "Known ratio sd y/sd x", "sd # mean", 
        "High sd y/sd x", "robust"))

########################################
### what about the standard errors of parameters ###
########################################

library(sandwich)
l <- lm(y ~ x)

# Normal Variance-Covariance Matrix for a Fitted Model Object
sqrt(diag(vcov(l)))

# Robust method to estimate the standard error of parameters
# Heteroskedasticity and Autocorrelation Consistent (HAC) 
#              Covariance Matrix Estimation
# it can be used also for a glm object
sqrt(diag(vcovHAC(l)))

Commentaires

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

stepAIC from package MASS with AICc

Install treemix in ubuntu 20.04