p-hacking: Another way to be sure to have significant results

I discover recently the quantile regression: rather than doing a regression on your points, you are doing a regression on the quantile(probs) of your points. As you have an infinite possibility for probs, you can do an infinite number of p-values... be sure that you will find one significant !

Let try:

df <- data.frame(y=runif(n = 10*20), x=rep(1:10, 20))

g0 <- glm(formula = y ~ x, data=df)
k <- summary(g0)

pvalue <- NULL

for (probs in seq(from=0.01, to=0.99, by=0.01)) {
q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs)))
g <- glm(formula = y ~ x, data=q)
pvalue <- c(pvalue, summary(g)$coefficients["x", "Pr(>|t|)"])
}

h <- hist(pvalue, breaks=seq(from=0, to=1, by=0.05), las=1)
polygon(x=c(h$breaks[1], h$breaks[2], h$breaks[2], h$breaks[1]),
        y=c(0, 0, h$counts[1], h$counts[1]), 
        col="grey")

probs <- seq(from=0.01, to=0.99, by=0.01)[which.min(pvalue)]
q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs)))
g <- glm(formula = y ~ x, data=q)

plot(x = df$x, y=df$y, bty="n", pch=19, las=1, xaxt="n")
axis(1, at=1:10)
points(x = q$x, y=q$y, col="red", pch=19)
lines(x = q$x, y=g$fitted.values, col="red", lwd=2)
legend("bottomleft", legend = c(paste("p value=", k$coefficients["x", "Pr(>|t|)"]), 
                                paste("p value=", min(pvalue)))
       , col=c("black", "red"), pch=19)
     

The regression with all the points (in black) was not significant... pity after all this work !
Let try with all the quantiles from 0.01 to 0.99 (step 0.01).


Great: 3 quantiles produced significant p-value ! The most significant is shown in red in the previous graph.

In conclusion, take care when you see quantile regression in a paper !

Commentaires

Posts les plus consultés de ce blog

Standard error from Hessian Matrix... what can be done when problem occurs

stepAIC from package MASS with AICc

Install treemix in ubuntu 20.04