Articles

Why p-values over-estimate first order risk ?

 The short answer is: 👉  Because a p-value is computed  conditional on the null hypothesis being true , it does  not  represent the probability of making a Type I error in the situation you are actually in. When it is interpreted as such, it systematically overstates (over-estimates) the “first-order risk”. Below is the precise reasoning. 1. What “first-order risk” really is The  Type I error rate  (first-order risk) is: α = P ( reject  H 0 ∣ H 0  is true ) This is a  long-run, pre-specified  property of a decision rule (e.g. “reject if  p < 0.05 ”). It is  not a probability about the current experiment . 2. What a p-value actually is A p-value is: p = P ( T ≥ t obs ∣ H 0 ) Key points: It is  conditional on  H 0 ​  being true It is  not   P ( H 0 ∣ data ) It is  not   P ( Type I error ) 3. Where the over-estimation comes from The common (incorrect) interpretation “If...

Hypotheses and ANOVA

👉  For ANOVA,  both  the homogeneity of variances and the normality assumptions concern the  errors of the model , so they should be assessed on the  residuals . Below is the precise reasoning, with practical nuances. 1. What ANOVA actually assumes The classical ANOVA model is: Y i j = μ + α i + ε i j Y ij ​ = μ + α i ​ + ε ij ​ with the assumptions: Normality : ε i j ∼ N ( 0 , σ 2 ) ε ij ​ ∼ N ( 0 , σ 2 ) Homoscedasticity : V a r ( ε i j ) = σ 2 Var ( ε ij ​ ) = σ 2  for all groups Independence  of  ε i j ε ij ​ So  both assumptions apply to the  errors , not to the raw response  Y Y . 2. Consequences for diagnostics ✅ Normality Should be assessed on  residuals , not on original data. Raw data can be non-normal simply because group means differ. Correct tools: Q–Q plot of residuals Histogram of residuals Shapiro–Wilk test  on residuals  (with caution) ✅ Homogeneity of variances Also concerns  residual variance ...

Install gsl package in Ubuntu 24.04

 You must first install  sudo apt install libgsl-dev and then you can install gsl package in R: install.packages("gsl")

Fitting an exponential model with log(y) = a + b t or y = exp(a + b t)

  Data x = 2010 : 2020 (11 points) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) To simplify interpretation, the year is often centered: t = x − 2010 = 0 , 1 , … , 10 t = x − 2010 = 0 , 1 , … , 10 1️⃣ Linear regression on log(y) Model log ⁡ ( y ) = α + β t + ε Key assumption the error is  additive on the log scale therefore  multiplicative on the original scale Fit (order of magnitude) One typically obtains something like: log ⁡ ( y ) ≈ 2.2 + 0.36   t Back to the original scale y ^ = exp ⁡ ( 2.2 + 0.36   t ) 👉 regular exponential growth 👉 relative errors are roughly constant 👉 small values have as much weight as large ones 2️⃣ Direct nonlinear regression on y Model y = a e b t + ε Key assumption the error is  additive on y variance is assumed constant on the original scale Typical fit y ^ ≈ 9.5   e 0.39 t Consequences large values (300, 400) strongly dominate the fit early years ...

Confidence interval vs credible interval

  1. Confidence interval (frequentist) Definition A  95% confidence interval  is a procedure that, if repeated many times on new data generated under the same conditions, would contain the true parameter  95% of the time . Key point The parameter is fixed but unknown; the interval is random. Correct interpretation “If we were to repeat this study infinitely many times and compute a 95% confidence interval each time, 95% of those intervals would contain the true parameter.” Incorrect (but common) interpretation “There is a 95% probability that the true parameter lies within this interval.” ❌ That statement is  not  valid in frequentist statistics. Example You estimate a mean nest temperature and obtain a 95% CI of [28.1, 29.3] °C. You cannot assign a probability to the true mean being inside this specific interval—either it is or it isn’t. 2. Credible interval (Bayesian) Definition A  95% credible interval  is an interval within which the parameter...

The normality rule for lm: what should be normal

Let give an answer to this question: "In a lm, is it important that the variable are distributed Gaussianly or is it the residual?" In a linear model (lm), it is the residuals that matter — not the distribution of the original variables. Here is the key distinction: ✅  What must be (approximately) Gaussian? •          The residuals (errors), conditional on the predictors, should be roughly normally distributed if you want valid confidence intervals, standard errors, and p-values. ❌  What does not need to be Gaussian? •          The raw variables (predictor or response) do NOT need to follow a normal distribution. •          Linear regression works fine with skewed variables, non-Gaussian predictors, etc. Why residual normality matters Normality of residuals ensures: •          estimates of standard errors are valid •          hypothesis tests (t-tests, F...

rgl in MacOSX Tahoe

 To all rgl users on MacOS: I've had several reports that rgl's X11 windows don't work on MacOS Tahoe.  From a limited amount of research, this looks like something that will not be easily fixed.  It will require changes to Tahoe or XQuartz, and neither of those look likely at this time. You can still use the WebGL display, where your scenes are displayed by running rglwidget().  This can be set to happen automatically by setting the options   options(rgl.useNULL = TRUE, rgl.printRglwidget = TRUE) The rglwidget() display is not quite as friendly as the X11 display, but it's the best you will get for now at least.  If you have a choice, I'd suggest avoiding the update to MacOS Tahoe. Right now I'm trying to see if I can reinstall Sequoia... Duncan Murdoch