Articles

Affichage des articles du août, 2019

Deviance test for binomial distribution

# Let A et B being observations from binomial distribution: A <- c(10, 10, 9, 7, 3, 2, 3) B <- c(2, 5, 6, 6, 10, 15, 20) # And the prediction of proportion being pred: pred <- c(0.9, 0.85, 0.75, 0.4, 0.3, 0.2, 0.1) # For example prod could be the result of a glm, or other model. # Number of parameters obtained from observations and used to estimate pred parameter <- 2 # Then the degrees of freedom are: df <- length(pred)-parameter LnL <- sum(dbinom(x=A, size = A+B,                    prob = pred, log = TRUE)) LnLSat <- sum(dbinom(x=A, size = A+B,                       prob = A/(A+B), log = TRUE)) deviance <- -2*(LnL - LnLSat) p <- 1-pchisq(deviance, df=df) # The larger is p, the more the observations could have been obtained from the model used to calculate pred # Do not use p<0.05 as a criteria; p is like a proxy plot(x=pred, y=A/(A+B), bty="n", las=1, xlim=c(0, 1), ylim=c(0,1)) text(x=0.8, y=0.

p-hacking: Another way to be sure to have significant results

Image
I discover recently the quantile regression: rather than doing a regression on your points, you are doing a regression on the quantile(probs) of your points. As you have an infinite possibility for probs, you can do an infinite number of p-values... be sure that you will find one significant ! Let try: df <- data.frame(y=runif(n = 10*20), x=rep(1:10, 20)) g0 <- glm(formula = y ~ x, data=df) k <- summary(g0) pvalue <- NULL for (probs in seq(from=0.01, to=0.99, by=0.01)) { q <- as.data.frame(aggregate(df, by=list(df$x), FUN = function(x) quantile(x, probs=probs))) g <- glm(formula = y ~ x, data=q) pvalue <- c(pvalue, summary(g)$coefficients["x", "Pr(>|t|)"]) } h <- hist(pvalue, breaks=seq(from=0, to=1, by=0.05), las=1) polygon(x=c(h$breaks[1], h$breaks[2], h$breaks[2], h$breaks[1]),         y=c(0, 0, h$counts[1], h$counts[1]),          col="grey") probs <- seq(from=0.01, to=0.99, by=0.01)[which.min(pvalue)]

Another R inferno: if () {} else {} + x

if (TRUE) { 0 } else { 2 } + 3 will return 0 and not 3 because the priority order is: if (TRUE) then {0} else {2} + 3 This is clearly counter-intuitive. As a general rule, I propose to never use result of a if statement directly in a computing. Note that ifelse() gives more normal result: ifelse(TRUE, 0, 2) + 3 will return 3

sample(x, length) or x[runif(length, 1:(length(x)+1))] : what is the most rapid?

microbenchmark(   sample(x = x, size=1000L, replace = TRUE) , sample(x = x, size=1000, replace = TRUE) , x[floor(runif(1000L, 1L, length(x)+1L))] , x[floor(runif(1000, 1, length(x)+1))] , x[runif(1000L, 1L, length(x)+1L)] , x[runif(1000, 1, length(x)+1)] , times = 10000L ) Unit: microseconds                                         expr    min      lq     mean  median      uq  sample(x = x, size = 1000L, replace = TRUE) 42.202 46.0750 51.70049 50.4445 53.1295   sample(x = x, size = 1000, replace = TRUE) 41.433 46.0935 51.52335 50.5570 53.0650   x[floor(runif(1000L, 1L, length(x) + 1L))] 40.027 42.2450 49.22916 49.5545 52.0355      x[floor(runif(1000, 1, length(x) + 1))] 40.112 42.1865 59.03723 49.5050 51.9370          x[runif(1000L, 1L, length(x) + 1L)] 37.124 39.2570 46.15197 46.4555 48.9570             x[runif(1000, 1, length(x) + 1)] 37.093 39.1935 54.89901 46.3630 48.7875 In conclusion, the most rapid is  x[runif(1000, 1, length(x) + 1)] But the

Integer, L, double, numeric

in R, integers can be setup using as.integer(x) or 1000L (L means L, long, because it used 32 bits representation of number which was longer than the old 8 or 16 bits !). f <- integer(length) creates a vector of size length with only 0. > f <- integer(1) > f [1] 0 > is.integer(f) [1] TRUE > identical(f, 0L) [1] TRUE The double() or numeric() are the same number representation: real() for R <3.0.0 was the same but is deprecated. microbenchmark(   1L:2000L   , 1:1000   , as.integer(1):as.integer(2000)   , times = 1000000L ) Unit: nanoseconds                            expr min  lq     mean median  uq      max                        1L:2000L 129 144 181.2866    147 153    64915                          1:1000 126 147 203.7078    149 154 20499610  as.integer(1):as.integer(2000) 299 329 453.9530    335 350 32912817

Bug in knitr / Rmarkdown for producing word document and graphics in pdf

Let try this minimal Rmarkdown file: --- title: "cex in Rmarkdown" output: word_document --- ```{r} knitr::opts_chunk$set(dev='pdf') ``` ```{r} plot((0:160)/4, 0:160, type="n") text(x=20, y=70, labels =expression(alpha), cex=1e-7) ``` When knitr-red from Rstudio (with r 3.6.1 on MacosX with knitr 1.24), it produced an error with this message ("Information of metric non available for this family") [translation from French because of my system configuration] However, text(x=20, y=70, labels =expression(alpha), cex=0) and text(x=20, y=70, labels =expression(alpha), cex=0.1) work. Also text(x=20, y=70, labels ="a", cex=1e-7) works but text(x=20, y=70, labels =expression("K"["0"]), cex=1e-7)   produced also an error. The error does not occur when dev='png' is used but it occurs also for dev='postscript'. The only solution that I found is something like that: cexindex <- The value that y

Xcode updates

After each update of Xcode, do not forget to open it and install the command line tool. Then in terminal, do: sudo xcodebuild -license This command is not necessary if you open Xcode and install command line tools: sudo xcode-select --install For older version of Xcode: Install the sdk headers for version 10.14 with: open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg They are installed with 10.15 version. Check using: cd /Library/Developer/CommandLineTools/SDKs (base) marcgirondot@MacBook-Air-de-Marc SDKs % ls -al total 0 drwxr-xr-x 5 root wheel 160 11 déc 2019 . drwxr-xr-x 5 root wheel 160 15 nov 2019 .. lrwxr-xr-x 1 root wheel 15 11 déc 2019 MacOSX.sdk -> MacOSX10.15.sdk drwxr-xr-x 7 root wheel 224 11 déc 2019 MacOSX10.14.sdk drwxr-xr-x 8 root wheel 256 11 déc 2019 MacOSX10.15.sdk