p-hacking: Another way to be sure to have significant results
I discover recently the quantile regression: rather than doing a regression on your points, you are doing a regression on the quantile(probs) of your points. As you have an infinite possibility for probs, you can do an infinite number of p-values... be sure that you will find one significant !
Let try:
df <- data.frame(y=runif(n = 10*20), x=rep(1:10, 20))
g0 <- glm(formula = y ~ x, data=df)
k <- summary(g0)
pvalue <- NULL
for (probs in seq(from=0.01, to=0.99, by=0.01)) {
q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs)))
g <- glm(formula = y ~ x, data=q)
pvalue <- c(pvalue, summary(g)$coefficients["x", "Pr(>|t|)"])
}
h <- hist(pvalue, breaks=seq(from=0, to=1, by=0.05), las=1)
polygon(x=c(h$breaks[1], h$breaks[2], h$breaks[2], h$breaks[1]),
y=c(0, 0, h$counts[1], h$counts[1]),
col="grey")
probs <- seq(from=0.01, to=0.99, by=0.01)[which.min(pvalue)]
q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs)))
g <- glm(formula = y ~ x, data=q)
plot(x = df$x, y=df$y, bty="n", pch=19, las=1, xaxt="n")
axis(1, at=1:10)
points(x = q$x, y=q$y, col="red", pch=19)
lines(x = q$x, y=g$fitted.values, col="red", lwd=2)
legend("bottomleft", legend = c(paste("p value=", k$coefficients["x", "Pr(>|t|)"]),
paste("p value=", min(pvalue)))
, col=c("black", "red"), pch=19)
The regression with all the points (in black) was not significant... pity after all this work !
Let try:
df <- data.frame(y=runif(n = 10*20), x=rep(1:10, 20))
g0 <- glm(formula = y ~ x, data=df)
k <- summary(g0)
pvalue <- NULL
for (probs in seq(from=0.01, to=0.99, by=0.01)) {
q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs)))
g <- glm(formula = y ~ x, data=q)
pvalue <- c(pvalue, summary(g)$coefficients["x", "Pr(>|t|)"])
}
h <- hist(pvalue, breaks=seq(from=0, to=1, by=0.05), las=1)
polygon(x=c(h$breaks[1], h$breaks[2], h$breaks[2], h$breaks[1]),
y=c(0, 0, h$counts[1], h$counts[1]),
col="grey")
probs <- seq(from=0.01, to=0.99, by=0.01)[which.min(pvalue)]
q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs)))
g <- glm(formula = y ~ x, data=q)
plot(x = df$x, y=df$y, bty="n", pch=19, las=1, xaxt="n")
axis(1, at=1:10)
points(x = q$x, y=q$y, col="red", pch=19)
lines(x = q$x, y=g$fitted.values, col="red", lwd=2)
legend("bottomleft", legend = c(paste("p value=", k$coefficients["x", "Pr(>|t|)"]),
paste("p value=", min(pvalue)))
, col=c("black", "red"), pch=19)
The regression with all the points (in black) was not significant... pity after all this work !
Let try with all the quantiles from 0.01 to 0.99 (step 0.01).
Great: 3 quantiles produced significant p-value ! The most significant is shown in red in the previous graph.
In conclusion, take care when you see quantile regression in a paper !
Commentaires
Enregistrer un commentaire