BiostatR Blog

Articles

Affichage des articles du décembre, 2025

Why p-values over-estimate first order risk ?

décembre 30, 2025

The short answer is: 👉 Because a p-value is computed conditional on the null hypothesis being true , it does not represent the probability of making a Type I error in the situation you are actually in. When it is interpreted as such, it systematically overstates (over-estimates) the “first-order risk”. Below is the precise reasoning. 1. What “first-order risk” really is The Type I error rate (first-order risk) is: α = P ( reject H 0 ∣ H 0 is true ) This is a long-run, pre-specified property of a decision rule (e.g. “reject if p < 0.05 ”). It is not a probability about the current experiment . 2. What a p-value actually is A p-value is: p = P ( T ≥ t obs ∣ H 0 ) Key points: It is conditional on H 0 being true It is not P ( H 0 ∣ data ) It is not P ( Type I error ) 3. Where the over-estimation comes from The common (incorrect) interpretation “If...

Hypotheses and ANOVA

décembre 30, 2025

👉 For ANOVA, both the homogeneity of variances and the normality assumptions concern the errors of the model , so they should be assessed on the residuals . Below is the precise reasoning, with practical nuances. 1. What ANOVA actually assumes The classical ANOVA model is: Y i j = μ + α i + ε i j Y ij = μ + α i + ε ij with the assumptions: Normality : ε i j ∼ N ( 0 , σ 2 ) ε ij ∼ N ( 0 , σ 2 ) Homoscedasticity : V a r ( ε i j ) = σ 2 Var ( ε ij ) = σ 2 for all groups Independence of ε i j ε ij So both assumptions apply to the errors , not to the raw response Y Y . 2. Consequences for diagnostics ✅ Normality Should be assessed on residuals , not on original data. Raw data can be non-normal simply because group means differ. Correct tools: Q–Q plot of residuals Histogram of residuals Shapiro–Wilk test on residuals (with caution) ✅ Homogeneity of variances Also concerns residual variance ...

Install gsl package in Ubuntu 24.04

décembre 23, 2025

You must first install sudo apt install libgsl-dev and then you can install gsl package in R: install.packages("gsl")

Fitting an exponential model with log(y) = a + b t or y = exp(a + b t)

décembre 21, 2025

Data x = 2010 : 2020 (11 points) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) y = ( 10 , 10 , 15 , 20 , 30 , 60 , 100 , 120 , 200 , 300 , 400 ) To simplify interpretation, the year is often centered: t = x − 2010 = 0 , 1 , … , 10 t = x − 2010 = 0 , 1 , … , 10 1️⃣ Linear regression on log(y) Model log ⁡ ( y ) = α + β t + ε Key assumption the error is additive on the log scale therefore multiplicative on the original scale Fit (order of magnitude) One typically obtains something like: log ⁡ ( y ) ≈ 2.2 + 0.36 t Back to the original scale y ^ = exp ⁡ ( 2.2 + 0.36 t ) 👉 regular exponential growth 👉 relative errors are roughly constant 👉 small values have as much weight as large ones 2️⃣ Direct nonlinear regression on y Model y = a e b t + ε Key assumption the error is additive on y variance is assumed constant on the original scale Typical fit y ^ ≈ 9.5 e 0.39 t Consequences large values (300, 400) strongly dominate the fit early years ...

Confidence interval vs credible interval

décembre 17, 2025

1. Confidence interval (frequentist) Definition A 95% confidence interval is a procedure that, if repeated many times on new data generated under the same conditions, would contain the true parameter 95% of the time . Key point The parameter is fixed but unknown; the interval is random. Correct interpretation “If we were to repeat this study infinitely many times and compute a 95% confidence interval each time, 95% of those intervals would contain the true parameter.” Incorrect (but common) interpretation “There is a 95% probability that the true parameter lies within this interval.” ❌ That statement is not valid in frequentist statistics. Example You estimate a mean nest temperature and obtain a 95% CI of [28.1, 29.3] °C. You cannot assign a probability to the true mean being inside this specific interval—either it is or it isn’t. 2. Credible interval (Bayesian) Definition A 95% credible interval is an interval within which the parameter...

The normality rule for lm: what should be normal

décembre 14, 2025

Let give an answer to this question: "In a lm, is it important that the variable are distributed Gaussianly or is it the residual?" In a linear model (lm), it is the residuals that matter — not the distribution of the original variables. Here is the key distinction: ✅ What must be (approximately) Gaussian? • The residuals (errors), conditional on the predictors, should be roughly normally distributed if you want valid confidence intervals, standard errors, and p-values. ❌ What does not need to be Gaussian? • The raw variables (predictor or response) do NOT need to follow a normal distribution. • Linear regression works fine with skewed variables, non-Gaussian predictors, etc. Why residual normality matters Normality of residuals ensures: • estimates of standard errors are valid • hypothesis tests (t-tests, F...