BiostatR Blog

Articles

Affichage des articles du décembre, 2017

Optimisation: sort or order and indicating default parameter

décembre 28, 2017

First, it is better to use sort(x) or x[order(x)]: > x <- sample(1:100, 100) > system.time({ + for (i in 1:100000) y <- x[order(x, decreasing = FALSE)] + }) utilisateur système écoulé 1.908 0.043 1.965 > system.time({ + for (i in 1:100000) y <- sort(x, decreasing = FALSE) + }) utilisateur système écoulé 2.993 0.035 3.043 Clearly the syntax x[order(x)] is more rapid (1/3 more fast). Now a second test: decreasing = FALSE is the default; what is the difference if it is not indicated: > system.time({ + for (i in 1:100000) y <- x[order(x)] + }) utilisateur système écoulé 1.907 0.037 1.952 > > system.time({ + for (i in 1:100000) y <- sort(x) + }) utilisateur système écoulé 3.049 0.035 3.094 No big change...

Sort your data to get a accurate sum

décembre 27, 2017

Based on a message of Jan Motl in r-help discussion list. If the numbers are positive, to get an accurate sum, sort the data by increasing order. > x <- c(rep(1.1, 10000), 10^16) > dput(c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))) c(10000000000010996, 10000000000011000) If the numbers are negative, to get an accurate sum, sort the data by decreasing order. > x <- c(rep(-1.1, 10000), -10^16) > dput(c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))) c(-10000000000011000, -10000000000010996) If the numbers are both positive and negative, no change. > x <- c(rep(1.1, 5000), rep(-1.1, 5000), 10^16) > dput(c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))) c(1e+16, 1e+16) Higham NJ (2002) Accuracy and stability of numerical algorithms. Society for Industrial and Applied Mathematics, Philadelphia, USA Chapter 4 - Summation See also https://stackoverflow.com/questions/47847295/why

The problem of constant SD for likelihood of Gaussian distribution

décembre 18, 2017

Imagine that you have a Gaussian distribution N(mean, sd) and you search for a model f(xi) that explained the best your observed data ni. A common solution is to use maximum likelihood and then minimizing -Ln L using dnorm(ni, mean=f(xi), sd=SD, log=TRUE) with SD being fitted. However, doing this, you give huge weight to the large values ni. Imagine n1=1000 and n2=10. The model predicting n from the x gives a 10% error prediction. Then f(x1)=1100 and f(x2)=11. The -ln likelihood of 1 and 2 are: L1 <- -dnorm(x=n1, mean=f(x1), sd=SD, log=TRUE) and L2 <- -dnorm(x=n2, mean=f(x2), sd=SD, log=TRUE) The minimum for L1 is 6.024496 with SD=102 and for L2 is 1.418939 with SD=1. Then contribution of L1 in the total likelihood is much higher than contribution of L2... the values with higher figures drive the fit. This is normal because the absolute deviation is higher and the model suppose no heteroskedasticity. But heteroskedasticity is very very often observed. So it is bett

Distribution of Pearson coefficient of correlation

décembre 15, 2017

library(hypergeo) pearson.p <- function(r, n, rho=0) { # n # Number of observations # r # Valeur for which density needs to be estimated # rho # Observed correlation N <- (n-2)*gamma(n-1)*(1-rho^2)^((n-1)/2)*(1-r^2)^((n-4)/2) D <- sqrt(2*pi)*gamma(n-1/2)*(1-rho*r)^(n-3/2) P <- hypergeo(A=1/2, B=1/2, C=(2*n-1)/2, z=(rho*r+1)/2) return((N/D)*P) } # When the correlation is null, it simplified to : pearson.p.0 <- function(r, n) { pofr <- ((1-r^2)^((n-4)/2))/beta(a = 1/2, b = (n-2)/2) return(pofr) } n <- 50 # Generate 1000 series of n random variables m <- sapply(X = 1:1000, FUN = function(x) rnorm(n)) # Calculate the pearson correlation between all the series (1000x1000) cors <- cor(x = m, method = "pearson") # take only the off-diagonals, otherwise there'd be duplicates cor.data <- cors[upper.tri(cors, diag = FALSE)] par(mar=c(4, 4, 1, 1)+0.4) hist(cor.data, xlim=c(-1, 1), freq=FALSE, las=1,

Error "failed to lock directory"

décembre 14, 2017

If you get this message when installing a package: ERROR: failed to lock directory ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library’ for modifying Try using: install.packages(xxx, INSTALL_opts = c('--no-lock')) or in terminal: R CMD INSTALL --no-lock xxx