The mess of linear regressions

octobre 04, 2018

Everyone has done a linear regression... but not everyone knows really how to do a linear regression. There are several methods which produce different outputs. Here is a little synthesis:

# Let generate data. Note the the scale for x and y are different.

# the set.seed(1) makes sure that the data are always the same

# to make reproduced examples

set.seed(1)

x <- 1:10

y <- 10*x+rnorm(10, 5, 20)

plot(x, y, bty="n", las=1)

# First method. Only error on y axis is supposed

# The square of the residuals on y axis is minimized

# Why the sum of square? It is grounded on the likelihood of Gaussian distribution for residuals

lm(y~ x)

abline(lm(y~ x), col="red")

# second, very similar to previous one, only error on x axis is supposed

# The square of the residuals on x axis is minimized

ir <- coef(lm(x~y))

print(c(-ir[1]/ir[2],1/ir[2]))

abline(-ir[1]/ir[2],1/ir[2], col="blue")

# The choice between error inputed on y or x is important.

# The error must be inputed on the dependent variable, ie the one that is not controlled by data collector. Note however that there are two kinds of dependent variable: those that are simply measured (for example sex) and those which are controlled (for example the quantity of light changed by experimenter)

# However, in many cases (for example in allometry), both x and y variables are dependent. Then least-square methods are not adequate. Furthermore, the quality of measurement for independent variable is not always perfect and error should be taken into account.

# There are several solutions

# Here error on both x and y axis are allowed.

# The product of the residual for x and y axis are minimized

# it is equivalent to minimized the surface of the rectangle

# from the point to the slope

frect <- function(x,y){ # droite des moindres rectangles

b <- sqrt(var(y)/var(x))*sign(cor(x,y))

c(mean(y)-b*mean(x),b)

}

frect(x, y)

abline(frect(x, y), col="green")

# Another solution is when the orthogonal distance of the residuals to the slope are minimized

# Note that least rectangle regression is the same than orthogonal regression when the variance of x and y are the same.

# Then orthogonal regression can be seen as a least rectangle regression for standardized variables

library("pracma")

odregress(x, y)$coeff

abline(rev(odregress(x, y)$coeff), col="yellow")

# Same here but the ratio of standard error of residuals is known

library("MethComp")

# vr The assumed known ratio of the (residual) variance of the ys relative to that of the xs. Defaults to 1.

# sdr do. for standard deviations. Defaults to 1

Deming(x, y)

abline(Deming(x, y)[1:2], col="black")

# here it is supposed proportional to the mean

mean(x)

mean(y)

Deming(x, y, sdr=mean(y)/mean(x))

abline(Deming(x, y, sdr=mean(y)/mean(x))[1:2], col="purple")

# If the ratio is huge, it means that only error in y axis is taken into account

# it is nearly the same as lm(y~ x)

abline(Deming(x, y, sdr=1000000)[1:2], col="black")

######## a robust version of lm

# from Wikipedia:

# Robust statistics are statistics with good performance for data

# drawn from a wide range of probability distributions,

# especially for distributions that are not normally distributed.

library("robustbase")

lr <- lmrob(y ~ x)

abline(lr, col= "black ", lty=2)

legend("bottomright", col=c("red", "blue", "green", "yellow",

"black", "purple", "black", "black"),

lty=c(1, 1, 1, 1, 1, 1, 1, 2),

legend=c("error on y", "error on x", "error x and y",

"orthogonal", "Known ratio sd y/sd x", "sd # mean",

"High sd y/sd x", "robust"))

########################################

### what about the standard errors of parameters ###

########################################

library(sandwich)

l <- lm(y ~ x)

# Normal Variance-Covariance Matrix for a Fitted Model Object

sqrt(diag(vcov(l)))

# Robust method to estimate the standard error of parameters

# Heteroskedasticity and Autocorrelation Consistent (HAC)

# Covariance Matrix Estimation

# it can be used also for a glm object

sqrt(diag(vcovHAC(l)))

Rechercher dans ce blog

BiostatR Blog

The mess of linear regressions

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

Install treemix in ubuntu 20.04

Multivariable analysis and correlation of iconography