Articles

Affichage des articles du décembre, 2025

Why p-values over-estimate first order risk ?

 The short answer is: 👉  Because a p-value is computed  conditional on the null hypothesis being true , it does  not  represent the probability of making a Type I error in the situation you are actually in. When it is interpreted as such, it systematically overstates (over-estimates) the “first-order risk”. Below is the precise reasoning. 1. What “first-order risk” really is The  Type I error rate  (first-order risk) is: α = P ( reject  H 0 ∣ H 0  is true ) This is a  long-run, pre-specified  property of a decision rule (e.g. “reject if  p < 0.05 ”). It is  not a probability about the current experiment . 2. What a p-value actually is A p-value is: p = P ( T ≥ t obs ∣ H 0 ) Key points: It is  conditional on  H 0 ​  being true It is  not   P ( H 0 ∣ data ) It is  not   P ( Type I error ) 3. Where the over-estimation comes from The common (incorrect) interpretation “If...

Hypotheses and ANOVA

👉  For ANOVA,  both  the homogeneity of variances and the normality assumptions concern the  errors of the model , so they should be assessed on the  residuals . Below is the precise reasoning, with practical nuances. 1. What ANOVA actually assumes The classical ANOVA model is: Y i j = μ + α i + ε i j Y ij ​ = μ + α i ​ + ε ij ​ with the assumptions: Normality : ε i j ∼ N ( 0 , σ 2 ) ε ij ​ ∼ N ( 0 , σ 2 ) Homoscedasticity : V a r ( ε i j ) = σ 2 Var ( ε ij ​ ) = σ 2  for all groups Independence  of  ε i j ε ij ​ So  both assumptions apply to the  errors , not to the raw response  Y Y . 2. Consequences for diagnostics ✅ Normality Should be assessed on  residuals , not on original data. Raw data can be non-normal simply because group means differ. Correct tools: Q–Q plot of residuals Histogram of residuals Shapiro–Wilk test  on residuals  (with caution) ✅ Homogeneity of variances Also concerns  residual variance ...

Install gsl package in Ubuntu 24.04

 You must first install  sudo apt install libgsl-dev and then you can install gsl package in R: install.packages("gsl")

Fitting an exponential model with log(y) = a + b t or y = exp(a + b t)

  Data x = 2010 : 2020 (11 points) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) To simplify interpretation, the year is often centered: t = x − 2010 = 0 , 1 , … , 10 t = x − 2010 = 0 , 1 , … , 10 1️⃣ Linear regression on log(y) Model log ⁡ ( y ) = α + β t + ε Key assumption the error is  additive on the log scale therefore  multiplicative on the original scale Fit (order of magnitude) One typically obtains something like: log ⁡ ( y ) ≈ 2.2 + 0.36   t Back to the original scale y ^ = exp ⁡ ( 2.2 + 0.36   t ) 👉 regular exponential growth 👉 relative errors are roughly constant 👉 small values have as much weight as large ones 2️⃣ Direct nonlinear regression on y Model y = a e b t + ε Key assumption the error is  additive on y variance is assumed constant on the original scale Typical fit y ^ ≈ 9.5   e 0.39 t Consequences large values (300, 400) strongly dominate the fit early years ...

Confidence interval vs credible interval

  1. Confidence interval (frequentist) Definition A  95% confidence interval  is a procedure that, if repeated many times on new data generated under the same conditions, would contain the true parameter  95% of the time . Key point The parameter is fixed but unknown; the interval is random. Correct interpretation “If we were to repeat this study infinitely many times and compute a 95% confidence interval each time, 95% of those intervals would contain the true parameter.” Incorrect (but common) interpretation “There is a 95% probability that the true parameter lies within this interval.” ❌ That statement is  not  valid in frequentist statistics. Example You estimate a mean nest temperature and obtain a 95% CI of [28.1, 29.3] °C. You cannot assign a probability to the true mean being inside this specific interval—either it is or it isn’t. 2. Credible interval (Bayesian) Definition A  95% credible interval  is an interval within which the parameter...

The normality rule for lm: what should be normal

Let give an answer to this question: "In a lm, is it important that the variable are distributed Gaussianly or is it the residual?" In a linear model (lm), it is the residuals that matter — not the distribution of the original variables. Here is the key distinction: ✅  What must be (approximately) Gaussian? •          The residuals (errors), conditional on the predictors, should be roughly normally distributed if you want valid confidence intervals, standard errors, and p-values. ❌  What does not need to be Gaussian? •          The raw variables (predictor or response) do NOT need to follow a normal distribution. •          Linear regression works fine with skewed variables, non-Gaussian predictors, etc. Why residual normality matters Normality of residuals ensures: •          estimates of standard errors are valid •          hypothesis tests (t-tests, F...