Articles

Affichage des articles du 2025

Why p-values over-estimate first order risk ?

 The short answer is: 👉  Because a p-value is computed  conditional on the null hypothesis being true , it does  not  represent the probability of making a Type I error in the situation you are actually in. When it is interpreted as such, it systematically overstates (over-estimates) the “first-order risk”. Below is the precise reasoning. 1. What “first-order risk” really is The  Type I error rate  (first-order risk) is: α = P ( reject  H 0 ∣ H 0  is true ) This is a  long-run, pre-specified  property of a decision rule (e.g. “reject if  p < 0.05 ”). It is  not a probability about the current experiment . 2. What a p-value actually is A p-value is: p = P ( T ≥ t obs ∣ H 0 ) Key points: It is  conditional on  H 0 ​  being true It is  not   P ( H 0 ∣ data ) It is  not   P ( Type I error ) 3. Where the over-estimation comes from The common (incorrect) interpretation “If...

Hypotheses and ANOVA

👉  For ANOVA,  both  the homogeneity of variances and the normality assumptions concern the  errors of the model , so they should be assessed on the  residuals . Below is the precise reasoning, with practical nuances. 1. What ANOVA actually assumes The classical ANOVA model is: Y i j = μ + α i + ε i j Y ij ​ = μ + α i ​ + ε ij ​ with the assumptions: Normality : ε i j ∼ N ( 0 , σ 2 ) ε ij ​ ∼ N ( 0 , σ 2 ) Homoscedasticity : V a r ( ε i j ) = σ 2 Var ( ε ij ​ ) = σ 2  for all groups Independence  of  ε i j ε ij ​ So  both assumptions apply to the  errors , not to the raw response  Y Y . 2. Consequences for diagnostics ✅ Normality Should be assessed on  residuals , not on original data. Raw data can be non-normal simply because group means differ. Correct tools: Q–Q plot of residuals Histogram of residuals Shapiro–Wilk test  on residuals  (with caution) ✅ Homogeneity of variances Also concerns  residual variance ...

Install gsl package in Ubuntu 24.04

 You must first install  sudo apt install libgsl-dev and then you can install gsl package in R: install.packages("gsl")

Fitting an exponential model with log(y) = a + b t or y = exp(a + b t)

  Data x = 2010 : 2020 (11 points) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) To simplify interpretation, the year is often centered: t = x − 2010 = 0 , 1 , … , 10 t = x − 2010 = 0 , 1 , … , 10 1️⃣ Linear regression on log(y) Model log ⁡ ( y ) = α + β t + ε Key assumption the error is  additive on the log scale therefore  multiplicative on the original scale Fit (order of magnitude) One typically obtains something like: log ⁡ ( y ) ≈ 2.2 + 0.36   t Back to the original scale y ^ = exp ⁡ ( 2.2 + 0.36   t ) 👉 regular exponential growth 👉 relative errors are roughly constant 👉 small values have as much weight as large ones 2️⃣ Direct nonlinear regression on y Model y = a e b t + ε Key assumption the error is  additive on y variance is assumed constant on the original scale Typical fit y ^ ≈ 9.5   e 0.39 t Consequences large values (300, 400) strongly dominate the fit early years ...

Confidence interval vs credible interval

  1. Confidence interval (frequentist) Definition A  95% confidence interval  is a procedure that, if repeated many times on new data generated under the same conditions, would contain the true parameter  95% of the time . Key point The parameter is fixed but unknown; the interval is random. Correct interpretation “If we were to repeat this study infinitely many times and compute a 95% confidence interval each time, 95% of those intervals would contain the true parameter.” Incorrect (but common) interpretation “There is a 95% probability that the true parameter lies within this interval.” ❌ That statement is  not  valid in frequentist statistics. Example You estimate a mean nest temperature and obtain a 95% CI of [28.1, 29.3] °C. You cannot assign a probability to the true mean being inside this specific interval—either it is or it isn’t. 2. Credible interval (Bayesian) Definition A  95% credible interval  is an interval within which the parameter...

The normality rule for lm: what should be normal

Let give an answer to this question: "In a lm, is it important that the variable are distributed Gaussianly or is it the residual?" In a linear model (lm), it is the residuals that matter — not the distribution of the original variables. Here is the key distinction: ✅  What must be (approximately) Gaussian? •          The residuals (errors), conditional on the predictors, should be roughly normally distributed if you want valid confidence intervals, standard errors, and p-values. ❌  What does not need to be Gaussian? •          The raw variables (predictor or response) do NOT need to follow a normal distribution. •          Linear regression works fine with skewed variables, non-Gaussian predictors, etc. Why residual normality matters Normality of residuals ensures: •          estimates of standard errors are valid •          hypothesis tests (t-tests, F...

rgl in MacOSX Tahoe

 To all rgl users on MacOS: I've had several reports that rgl's X11 windows don't work on MacOS Tahoe.  From a limited amount of research, this looks like something that will not be easily fixed.  It will require changes to Tahoe or XQuartz, and neither of those look likely at this time. You can still use the WebGL display, where your scenes are displayed by running rglwidget().  This can be set to happen automatically by setting the options   options(rgl.useNULL = TRUE, rgl.printRglwidget = TRUE) The rglwidget() display is not quite as friendly as the X11 display, but it's the best you will get for now at least.  If you have a choice, I'd suggest avoiding the update to MacOS Tahoe. Right now I'm trying to see if I can reinstall Sequoia... Duncan Murdoch

Show the student_t(20, 0, 2) in brms prior in R

 Here is to show the prior "student_t(20, 0, 2)" from brms in R: # Parameters df <- 20     # degrees of freedom mu <- 0      # mean (location) sigma <- 2   # scale # Define x range (say, ±4 standard deviations) x <- seq(mu - 5*sigma, mu + 5*sigma, length.out = 1000) # Student-t density y <- dt((x - mu) / sigma, df = df) / sigma # Plot plot(x, y, type = "l", lwd = 2, col = "blue",      main = "Student-t(20, 0, 2)",      xlab = "x", ylab = "Density") For a prior = "lognormal(-0.2, 0.8)" meanlog <- -0.2 sdlog <- 0.8 x <- seq(from=0, to=10, length=100) y <- dlnorm(x, meanlog, sdlog) plot(x, y)

What was stored in a .Rdata file

 If you load the .Rdata in an environment with already many data, it can be difficult to know what is new after the loading. The solution is to use: print(load("yourobject.Rdata"))

Read ECMWF data ERA 5

ERA5 is the latest development in the ERA series, and  improves significantly on its predecessors by: - Offering a higher horizontal resolution of 31 km and 137 vertical levels from the surface up to 0.01 hPa (around 80 km); - Using a more recent and advanced version of the ECMWF IFS model; - Providing hourly estimates of atmospheric variables; - Providing a consistent representation of uncertainties for these variables; - Using more satellite observations in the data assimilation. Here are some information to access this database using python batch Please visit http://climate.copernicus.eu/climate-reanalysis for more information including documentation and guidelines on how to download the data. https://codes.ecmwf.int/grib/param-db/139 Layer 1: 0 - 7cm Layer 2: 7 - 28cm Layer 3: 28 - 100cm Layer 4: 100 - 289cm Soil temperature is set at the middle of each layer, and heat transfer is calculated at the interfaces between them. It is assumed that there is no heat transfer out of th...

Compile gam on R 4.6.0

 Error because library libintl.h is not found; to solve: brew install gettext cd /opt/R/arm64/include/ sudo ln -s /opt/homebrew/include/libintl.h libintl.h

Force to show all labels in axis

 atT <- structure(c(1672520400, 1675198800, 1677618000, 1680296400, 1682888400,              1685566800, 1688158800, 1690837200, 1693515600, 1696107600, 1698786000,              1701378000, 1704056400, 1706734800, 1709240400, 1711918800, 1714510800,              1717189200, 1719781200, 1722459600, 1725138000, 1727730000, 1730408400,              1733000400, 1735678800), class = c("POSIXct", "POSIXt"), tzone = "Asia/Riyadh") # Work as expected plot(x = atT, y=rep(1, length(atT)), xaxt="n") axis(side = 1, at=atT, labels = as.character(seq_along(atT)))    # Only first label is shown plot(x = atT, y=rep(1, length(atT)), xaxt="n") axis(1, at=atT, label=c("Jan-2023", "", "", "", "", "", "Jul-2023", "", "", "", "", "",                                  ...

Likelihood of bivariate Gaussian distribution

Let X, one observation composed of two values (x1, x2) obtained from Gaussian distributions µ1, s1 and µ2, s2 with rho being a correlation coefficient between Gaussian distributions 1 and 2, the likelihood is: library(mvtnorm) dmvnorm(x=c(x1,x2), mean=c(µ1,µ2), sigma=matrix(c(s1, rho, rho, s2), ncol=2))

Options for svg graphics in RMarkdown

 These options seem to be ok: ```{r setup, include=TRUE} knitr::opts_chunk$set(fig.path='figs/xxx_',                        dev='svg',                        concordance=TRUE,                        echo=TRUE,                        fig.width=12, fig.height=8,                        dev.args=list(pointsize=16)) ```

Error when gfortran is used for package installation from source

 MacOSX 15.3 gcc installed using MacBrew I get this error: ld: warning: search path '/opt/gfortran/lib/gcc/aarch64-apple-darwin20.0/12.2.0' not found ld: warning: search path '/opt/gfortran/lib' not found ld: library 'gfortran' not found I solve it by creating a  ~/.R/Makevars  file like this: FC      = /opt/homebrew/opt/gfortran/bin/gfortran F77     = /opt/homebrew/opt/gfortran/bin/gfortran FLIBS   = -L/opt/homebrew/opt/gfortran/lib Now I can compile package with gfortran If it does not find the libintl.h header, for example when install robustbase package, use: brew install gettext cd /opt/R/arm64/include/ sudo ln -s /opt/homebrew/include/libintl.h libintl.h

optimx error in MacOsX

 Since this morning (5/2/2025), I get an error when I try to load optimx package: > library("optimx") Erreur : le chargement du package ou de l'espace de noms a échoué pour ‘optimx’ in dyn.load(file, DLLpath = DLLpath, ...) : impossible de charger l'objet partagé '/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/nloptr/libs/nloptr.so' :   dlopen(/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/nloptr/libs/nloptr.so, 0x0006): Library not loaded: /opt/homebrew/opt/nlopt/lib/libnlopt.0.dylib   Referenced from: <2A7DAFE2-8123-3555-9569-F28214328E47> /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/nloptr/libs/nloptr.so   Reason: tried: '/opt/homebrew/opt/nlopt/lib/libnlopt.0.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/nlopt/lib/libnlopt.0.dylib' (no such file), '/opt/homebrew/opt/nlopt/lib/libnlopt.0.dylib' (no such file), '/Libra...