What a P-Value Actually Means (and Five Things It Doesn't)

The Sentence That Starts This Conversation

Imagine you've collected data from an experiment comparing two nitrogen treatments on paddy rice — a control (no nitrogen) and urea at 80 kg N/ha — across eight replicate plots each. You harvest, weigh the grain, and run a t-test in R. The output tells you p = 0.03. You write, in your Results section:

"The urea treatment had a statistically significant effect on grain yield (p = 0.03)."

That sentence is probably wrong — or at least, it claims considerably more than the number in front of you can support.

Not because the analysis was done badly. Because p-values are among the most routinely misread outputs in all of quantitative research, and the misreading is built into the language most researchers use without noticing: "significant," "proved," "due to chance." A 2016 study published in Psychonomic Bulletin & Review found that the majority of researchers — including psychology professors who teach statistics — described p-values incorrectly when given a list of definitions and asked to select which ones were accurate. The same survey found that even experienced statisticians made errors at rates higher than you'd expect from a group who should know better.

This is not a beginner mistake. It runs through the literature of every empirical field.

This post fixes it. One definition. Five specific misreadings, each with the wrong sentence, the correction, and an explanation of why the mistake is so tempting. Then what to write instead.

What a P-Value Actually Is

The definition has been stated precisely many times, and it is still widely misread because it looks like it's saying something different from what it actually says. Here it is:

The definition

A p-value is the probability of observing data at least as extreme as what you observed, assuming the null hypothesis is true.

Three parts, all of which matter. Let's unpack them with the fertilizer data.

The null hypothesis is the assumption you're testing against — in this case, that nitrogen treatment has no effect on yield. The population means of the two groups are equal. Your experiment produced a gap of 0.52 t/ha between the control group (mean 3.12 t/ha) and the urea group (mean 3.64 t/ha). The t-test asks: if the null hypothesis were true and the true population means were identical, how often would random sampling produce a gap this large or larger?

The answer is 3% of the time. That is what p = 0.03 tells you.

Notice what it is not telling you. It is not telling you the probability that your result is "real." It is not telling you the probability that nitrogen treatment works. It is not partitioning your result into a real part and a chance part. It is a conditional probability — the probability of the data, given an assumed world in which H₀ is true.

That single distinction — between P(data | H₀) and P(H₀ | data) — is the source of four out of five misconceptions in this post.

1 It's not the probability the null hypothesis is true

What researchers write

"p = 0.03 means there's only a 3% chance the null hypothesis is true."

The null hypothesis is either true or it isn't. It's a statement about the world — about whether nitrogen really does change yield in this crop under these conditions — and the world doesn't come with a probability distribution attached. Saying H₀ has a 3% probability of being true would require knowing something about the prior likelihood of the hypothesis, which is exactly what you're trying to learn from the data.

What the p-value does is condition on H₀ being true. It asks: if this is the world, how likely is your data? It cannot simultaneously be the probability that this assumed world is wrong.

Why the mistake is tempting

P(data | H₀) and P(H₀ | data) look like they should be interchangeable by symmetry. They're not — flipping that conditional requires Bayes' theorem and prior probabilities that you don't have. Our intuition for conditional probability is poor, and the temptation to flip the conditioning is extremely natural.

2 It's not the probability your result happened by chance

What researchers write

"There's only a 3% probability that the yield difference occurred by chance."

"Occurred by chance" implies that some fraction of your result is "real" and the rest is random noise, and that the p-value quantifies the noise fraction. That's not what it does. The p-value doesn't decompose your result at all. It says: if chance were the only mechanism (i.e., if H₀ were true), the probability of seeing a gap this large is 3%. Chance is already embedded in the null model — the p-value is conditional on it, not a proportion of it.

The phrase "due to chance" also implies you already know the alternative is real. If you knew that, you wouldn't need the test.

Why the mistake is tempting

"Due to chance" is an intuitive shorthand for "consistent with random sampling variation under H₀," but it's loose enough to imply things the number doesn't support. The translation from the technical definition to plain language almost always loses something.

3 A smaller p doesn't mean a bigger or more important effect

What researchers write

"Treatment A (p = 0.001) had a stronger effect than Treatment B (p = 0.04)."

This is the most practically consequential misconception, because it leads directly to ignoring effect size.

A p-value conflates two things: how big the effect is, and how precisely you estimated it. With a large enough sample, a trivially small yield difference — 0.02 t/ha, agronomically irrelevant — can produce p < 0.0001. With a small sample, a practically important difference of 0.8 t/ha might produce p = 0.09. The p-value goes down as n goes up, regardless of whether the effect itself is large or small.

What you actually want to know is how large the effect is in real, interpretable units. For the fertilizer comparison: the mean yield difference is 0.52 t/ha. That's the number. It tells you something about whether the treatment is worth applying. The p-value tells you nothing about that — only about whether you have sufficient evidence to reject the null.

Cohen's d is the standard effect size measure for comparing two means. It expresses the difference in pooled standard deviation units, so it's comparable across studies:

Effect size thresholds (Cohen, 1988)

Small: d ≈ 0.2 — the groups differ by one-fifth of a standard deviation; the overlap is large and the effect is usually not practically meaningful. Medium: d ≈ 0.5. Large: d ≈ 0.8 or above — the groups are meaningfully separated and the effect is likely to matter in practice.

In the paddy yield example, the pooled SD is about 0.42 t/ha. Cohen's d = 0.52 / 0.42 = 1.24 — a large effect. A p-value of 0.03 tells you that the result is statistically detectable. A d of 1.24 tells you that nitrogen at 80 kg/ha roughly doubled the typical between-plot variation. These are different pieces of information and you need both.

Why the mistake is tempting

A more extreme result intuitively feels like a bigger result. And in a single, fixed design — same n, same measurement error — it often is. But the moment you compare across studies with different sample sizes, or across outputs from a single study with different sample sizes per group, p-values are no longer comparable as measures of effect magnitude.

4 p > 0.05 doesn't prove there's no effect

What researchers write

"The treatment had no effect on soil pH (p = 0.12)."

"No effect" means the true difference is zero. But p = 0.12 means only that the data are not inconsistent with H₀ — which is a very different statement. A non-significant result is consistent with at least three situations:

(a) The true effect is zero and you have correctly failed to reject H₀. (b) There is a real effect, but your sample was too small to detect it reliably — a power problem. (c) There is a real effect, but it was obscured by measurement error, poor experimental control, or an unhappy draw of random variation.

Without knowing which of these applies, you cannot write "no effect." You can write: "We did not detect a significant effect of treatment on soil pH (p = 0.12). The study may have been underpowered to detect effects smaller than X units." That is honest. It is also, unfortunately, much more useful to the reader than the false confidence of "no effect."

Why the mistake is tempting

"Non-significant" sounds like "not different," which sounds like "the same." It isn't. Absence of evidence is not evidence of absence — a principle that's easier to state than to internalise when you've been hoping your treatment would work.

5 0.05 is not a law of nature

What researchers write

"p = 0.049 is significant; p = 0.051 is not significant."

The 0.05 threshold was introduced by R.A. Fisher in the 1920s as a rough guideline for when a result would be worth investigating further in agricultural experiments. It was never intended as a binary gate on scientific truth. In Fisher's own words, a single experiment with p < 0.05 is "an indication that something deserves further investigation" — not a verdict.

A result at p = 0.049 and a result at p = 0.051 are statistically indistinguishable. The difference in sampling variation required to shift a result from one side of that line to the other is smaller than almost any source of real measurement uncertainty in your experiment. Treating them differently is not science; it's threshold worship.

The American Statistical Association's 2016 statement on p-values made this explicit: "The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process." Their guidance was to report p-values as continuous values and to consider them alongside the context of the study, the prior evidence, and — above all — the effect size.

Why the mistake is tempting

Journals and supervisors have reinforced the 0.05 threshold for so long that it feels like a rule rather than a convention. When "significant" determines publication and "non-significant" means rejection, researchers naturally optimise for the line rather than interrogate it.

✦ ✦ ✦

What to Report Instead

The corrective is not to abandon p-values. They carry real information about the probability of your data under H₀, and when combined with the right context, they're useful. The problem is reporting them alone as though they answer the question "does this work?"

Every result sentence in a thesis or manuscript should carry three things together:

1. The effect size in interpretable units. For a t-test: the mean difference (with units), Cohen's d. For ANOVA: η² or partial η². For regression: the coefficient with units. The effect size is what tells you whether the finding is biologically, agronomically, or practically meaningful — something the p-value cannot tell you at all.

2. The 95% confidence interval. A CI gives you a plausible range for the true population effect, based on your data. If the 95% CI for the yield difference is [0.09, 0.95] t/ha, you can see that the effect is positive, that you're confident it's real, and that the range of plausible effect sizes is relatively wide — useful information even before you look at the p-value. Crucially, the CI is in the same units as the effect, so it's directly interpretable.

3. The p-value — as context, not as a verdict. It tells you how surprising the data are under H₀. Below 0.05 means you'd see results this extreme less than 5% of the time by sampling chance alone. Report the actual value (p = 0.03), not "p < 0.05" — the precision matters.

Here is how to extract all three from a t-test in R:

# t.test() gives you the CI automatically
result <- t.test(yield ~ treatment, data = soil,
                  var.equal = FALSE)
result

# Effect size: Cohen's d (manual — no extra package needed)
means <- tapply(soil$yield, soil$treatment, mean)
sds   <- tapply(soil$yield, soil$treatment, sd)
ns    <- tapply(soil$yield, soil$treatment, length)

pooled_sd <- sqrt(((ns[1]-1)*sds[1]^2 + (ns[2]-1)*sds[2]^2) /
                   (ns[1] + ns[2] - 2))
cohens_d  <- abs(diff(means)) / pooled_sd

cat("Mean difference: ", round(diff(means), 3), "t/ha\n")
cat("95% CI:         ", round(result$conf.int, 3), "t/ha\n")
cat("Cohen's d:       ", round(cohens_d, 3), "\n")
cat("p-value:         ", round(result$p.value, 3), "\n")

Output

Mean difference:  0.52 t/ha
95% CI:          0.09  0.95 t/ha
Cohen's d:        1.24
p-value:          0.028

Now you can write a complete result sentence — one that actually communicates what you found:

The correct result sentence

"Urea application at 80 kg N/ha increased mean paddy grain yield by 0.52 t/ha compared to the unfertilized control (95% CI: 0.09–0.95 t/ha; Cohen's d = 1.24; t(14) = 2.57, p = 0.028)."

That sentence tells the reader what happened (a 0.52 t/ha increase), how certain you are (the CI doesn't contain zero), how large the effect is relative to variability (d = 1.24, large), and what the p-value was. The reader can judge significance by their own threshold — and can see that the effect is likely real and agronomically meaningful, regardless of where 0.05 lands.

The habit change is simple. Before you write any result sentence, ask two questions: what is the effect, and how certain am I? Then write that. The p-value goes at the end, in parentheses, as one piece of evidence — not as the answer.

Every test in the rest of this blog's data analysis series — ANOVA, Tukey HSD, linear regression, PCA — produces effect sizes and confidence intervals alongside p-values. This post is the reason why.

Sajjadur Rahman

MSc Researcher · Data Analyst · University of Dhaka

NST Fellow running experiments in soil and environmental science, using R and SPSS for analysis across both my own research and consulting work. I help graduate students understand what their output actually means — and write it correctly.

Data Analysis My Research About Me Contact