Articles

Affichage des articles du août, 2019

Deviance test for binomial distribution

# Let A et B being observations from binomial distribution: A <- c(10, 10, 9, 7, 3, 2, 3) B <- c(2, 5, 6, 6, 10, 15, 20) # And the prediction of proportion being pred: pred <- c(0.9, 0.85, 0.75, 0.4, 0.3, 0.2, 0.1) # For example prod could be the result of a glm, or other model. # Number of parameters obtained from observations and used to estimate pred parameter <- 2 # Then the degrees of freedom are: df <- length(pred)-parameter LnL <- sum(dbinom(x=A, size = A+B,                    prob = pred, log = TRUE)) LnLSat <- sum(dbinom(x=A, size = A+B,                       prob = A/(A+B), log = TRUE)) deviance <- -2*(LnL - LnLSat) p <- 1-pchisq(deviance, df=df) # The larger is p, the more the observations could have been obtained from the model used to calculate pred # Do not use p<0.05 as a criteria; p is lik...

p-hacking: Another way to be sure to have significant results

Image
I discover recently the quantile regression: rather than doing a regression on your points, you are doing a regression on the quantile(probs) of your points. As you have an infinite possibility for probs, you can do an infinite number of p-values... be sure that you will find one significant ! Let try: df <- data.frame(y=runif(n = 10*20), x=rep(1:10, 20)) g0 <- glm(formula = y ~ x, data=df) k <- summary(g0) pvalue <- NULL for (probs in seq(from=0.01, to=0.99, by=0.01)) { q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs))) g <- glm(formula = y ~ x, data=q) pvalue <- c(pvalue, summary(g)$coefficients["x", "Pr(>|t|)"]) } h <- hist(pvalue, breaks=seq(from=0, to=1, by=0.05), las=1) polygon(x=c(h$breaks[1], h$breaks[2], h$breaks[2], h$breaks[1]),         y=c(0, 0, h$counts[1], h$counts[1]),          col="grey") probs <- seq(from=0.01, to=0.99, by=0.01)[whic...

Another R inferno: if () {} else {} + x

if (TRUE) { 0 } else { 2 } + 3 will return 0 and not 3 because the priority order is: if (TRUE) then {0} else {2} + 3 This is clearly counter-intuitive. As a general rule, I propose to never use result of a if statement directly in a computing. Note that ifelse() gives more normal result: ifelse(TRUE, 0, 2) + 3 will return 3

sample(x, length) or x[runif(length, 1:(length(x)+1))] : what is the most rapid?

microbenchmark(   sample(x = x, size=1000L, replace = TRUE) , sample(x = x, size=1000, replace = TRUE) , x[floor(runif(1000L, 1L, length(x)+1L))] , x[floor(runif(1000, 1, length(x)+1))] , x[runif(1000L, 1L, length(x)+1L)] , x[runif(1000, 1, length(x)+1)] , times = 10000L ) Unit: microseconds                                         expr    min      lq     mean  median      uq  sample(x = x, size = 1000L, replace = TRUE) 42.202 46.0750 51.70049 50.4445 53.1295   sample(x = x, size = 1000, replace = TRUE) 41.433 46.0935 51.52335 50.5570 53.0650   x[floor(runif(1000L, 1L, length(x) + 1L))] 40.027 42.2450 49.22916 49.5545 52.0355      x[floor(runif(1000, 1, length(x) + 1))] 40.112 42.1865 59.03723 49.5050 51.9370          x[runif(1000L, 1L, length(x) + 1L)...

Integer, L, double, numeric

in R, integers can be setup using as.integer(x) or 1000L (L means L, long, because it used 32 bits representation of number which was longer than the old 8 or 16 bits !). f <- integer(length) creates a vector of size length with only 0. > f <- integer(1) > f [1] 0 > is.integer(f) [1] TRUE > identical(f, 0L) [1] TRUE The double() or numeric() are the same number representation: real() for R <3.0.0 was the same but is deprecated. microbenchmark(   1L:2000L   , 1:1000   , as.integer(1):as.integer(2000)   , times = 1000000L ) Unit: nanoseconds                            expr min  lq     mean median  uq      max                        1L:2000L 129 144 181.2866    147 153    64915                   ...

Bug in knitr / Rmarkdown for producing word document and graphics in pdf

Let try this minimal Rmarkdown file: --- title: "cex in Rmarkdown" output: word_document --- ```{r} knitr::opts_chunk$set(dev='pdf') ``` ```{r} plot((0:160)/4, 0:160, type="n") text(x=20, y=70, labels =expression(alpha), cex=1e-7) ``` When knitr-red from Rstudio (with r 3.6.1 on MacosX with knitr 1.24), it produced an error with this message ("Information of metric non available for this family") [translation from French because of my system configuration] However, text(x=20, y=70, labels =expression(alpha), cex=0) and text(x=20, y=70, labels =expression(alpha), cex=0.1) work. Also text(x=20, y=70, labels ="a", cex=1e-7) works but text(x=20, y=70, labels =expression("K"["0"]), cex=1e-7)   produced also an error. The error does not occur when dev='png' is used but it occurs also for dev='postscript'. The only solution that I found is something like that: cexindex <- The value that y...

Xcode updates

After each update of Xcode, do not forget to open it and install the command line tool. Then in terminal, do: sudo xcodebuild -license This command is not necessary if you open Xcode and install command line tools: sudo xcode-select --install For older version of Xcode: Install the sdk headers for version 10.14 with: open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg They are installed with 10.15 version. Check using: cd /Library/Developer/CommandLineTools/SDKs (base) marcgirondot@MacBook-Air-de-Marc SDKs % ls -al total 0 drwxr-xr-x 5 root wheel 160 11 déc 2019 . drwxr-xr-x 5 root wheel 160 15 nov 2019 .. lrwxr-xr-x 1 root wheel 15 11 déc 2019 MacOSX.sdk -> MacOSX10.15.sdk drwxr-xr-x 7 root wheel 224 11 déc 2019 MacOSX10.14.sdk drwxr-xr-x 8 root wheel 256 11 déc 2019 MacOSX10.15.sdk