If you have run a one-way ANOVA and it returned a significant p-value, you have answered one question: at least one group mean differs from the rest. You have not answered the question most researchers actually want answered: which groups differ, and by how much? Tukey HSD is the test that answers that.
This post explains what Tukey HSD is, the specific problem it was designed to solve, when it is the right choice and when it is not, and how to read and report its output correctly. It uses a different dataset from the ANOVA tutorial on this site so the two posts work as independent references.
1The Multiple Comparisons Problem
Suppose you have five treatment groups in a soil amendment experiment. After a significant ANOVA, you want to compare every possible pair of group means. With five groups, that is ten pairs: T1 vs T2, T1 vs T3, T1 vs T4, T1 vs T5, T2 vs T3, T2 vs T4, T2 vs T5, T3 vs T4, T3 vs T5, and T4 vs T5.
If you ran a separate t-test for each of those ten pairs at α = 0.05, the chance of a false positive in any individual test is 5 percent. But the chance of getting at least one false positive somewhere across ten tests is much higher. The calculation is 1 minus 0.95 to the power of 10, which equals roughly 40 percent. Run ten tests and there is a 40 percent chance that at least one of your "significant" results is wrong.
3 pairs 14%
6 pairs 26%
10 pairs 40%
21 pairs 66%
45 pairs 90%
Probability of at least one false positive when running all pairwise t-tests at α = 0.05. Tukey HSD holds this rate at 5% across all comparisons simultaneously.
This is called family-wise error rate (FWER) inflation. The more comparisons you make, the more likely you are to flag a difference as significant when none exists. Tukey HSD controls this. It computes a single critical value, the Honestly Significant Difference, that holds the FWER at α = 0.05 across all pairwise comparisons simultaneously, regardless of how many groups are in the design.
2What Tukey HSD Calculates
The HSD is the minimum difference between two group means required for the comparison to be statistically significant at your chosen alpha level, accounting for the total number of comparisons being made simultaneously.
Once HSD is computed, every pairwise difference is compared against it. If the absolute difference between two group means exceeds HSD, those groups are significantly different. The single HSD value acts as a universal threshold for the entire family of comparisons, which is how it controls the family-wise error rate.
3When to Use It — and When Not To
Tukey HSD is the right choice for most agricultural experiments. But several alternatives exist for specific situations, and choosing the wrong one will draw questions from reviewers.
| Test | Best for | Conservatism | Use? |
|---|---|---|---|
| Tukey HSD | All pairwise comparisons, equal or near-equal sample sizes. Most agricultural CRD and RCBD designs. | Moderate | Recommended |
| Tukey-Kramer | All pairwise comparisons with unequal sample sizes. Adjustment applied automatically by R and SPSS. | Moderate | Use when n differs |
| Bonferroni | Small number of pre-specified comparisons (3 to 5). More powerful than Tukey when comparisons are few and planned. | High | For few, pre-planned pairs |
| Dunnett's | Comparing all treatments to a single control only. Do not use if you also want treatment-to-treatment comparisons. | Moderate | Control-only comparisons |
| Duncan's MRT | Appears in older agricultural literature. Less conservative than Tukey; harder to defend methodologically to modern reviewers. | Low | Avoid in new manuscripts |
| Fisher's LSD | Only valid as a protected test after significant ANOVA, and only with exactly three groups. With more groups it inflates Type I error. | Very low | 3 groups only |
In practice: if your experiment has three or more treatment groups, all pairwise comparisons are of interest, and sample sizes are equal or close to equal, use Tukey HSD. If your only question is "which treatments beat the control?", use Dunnett's. Both are available in R's agricolae and base R packages.
4A Worked Example
The dataset below comes from a CRD pot experiment testing the effect of five organic amendment treatments on soil organic carbon (%) in Bangladeshi paddy soil, with three replications per treatment.
| Treatment | Rep 1 | Rep 2 | Rep 3 | Mean (%) |
|---|---|---|---|---|
| T1 — Control (no amendment) | 0.91 | 0.88 | 0.94 | 0.91 |
| T2 — Compost 2 t/ha | 1.24 | 1.18 | 1.21 | 1.21 |
| T3 — Biochar 2 t/ha | 1.38 | 1.42 | 1.35 | 1.38 |
| T4 — Compost + Biochar | 1.67 | 1.71 | 1.64 | 1.67 |
| T5 — Cow dung 5 t/ha | 1.15 | 1.12 | 1.19 | 1.15 |
The one-way ANOVA returns F(4,10) = 219.5, p < 0.001. At least one group mean differs significantly. Now computing Tukey HSD:
MSE from the ANOVA = 0.0011 · n per group = 3 · k = 5 groups · df_error = 10
Critical q at α = 0.05: q(0.05, 5, 10) = 4.65
HSD = 4.65 × √(0.0011 / 3) = 4.65 × 0.0191 = 0.089
Any two group means differing by more than 0.089 are significantly different at p < 0.05. The ten pairwise comparisons:
| Pair | Difference | vs HSD (0.089) | Significant? |
|---|---|---|---|
| T4 vs T3 | 0.290 | > 0.089 | YES |
| T4 vs T2 | 0.463 | > 0.089 | YES |
| T4 vs T5 | 0.520 | > 0.089 | YES |
| T4 vs T1 | 0.763 | > 0.089 | YES |
| T3 vs T2 | 0.173 | > 0.089 | YES |
| T3 vs T5 | 0.230 | > 0.089 | YES |
| T3 vs T1 | 0.473 | > 0.089 | YES |
| T2 vs T5 | 0.057 | < 0.089 | NO |
| T2 vs T1 | 0.300 | > 0.089 | YES |
| T5 vs T1 | 0.243 | > 0.089 | YES |
Nine of ten pairs are significantly different. T2 (Compost alone) and T5 (Cow dung) are not — their difference of 0.057 falls below the HSD of 0.089.
5Reading the Compact Letter Display
Agricultural journals report Tukey results as a compact letter display (CLD). Groups sharing a letter are not significantly different. Groups not sharing a letter are. Based on the ten comparisons above:
| Treatment | Mean SOC (%) | CLD |
|---|---|---|
| T4 — Compost + Biochar | 1.67 | a |
| T3 — Biochar 2 t/ha | 1.38 | b |
| T2 — Compost 2 t/ha | 1.21 | c |
| T5 — Cow dung 5 t/ha | 1.15 | c |
| T1 — Control | 0.91 | d |
T4 carries only "a": it is significantly higher than every other treatment. T3 carries "b": it differs from T4, T2, T5, and T1. T2 and T5 both carry "c": they are not significantly different from each other, but both differ from T4, T3, and T1. T1 carries "d": it is significantly lower than all fertilized treatments.
Two things about the CLD that frequently confuse readers. Sharing a letter means "not significantly different." It does not mean equal. And a treatment can carry multiple letters (like "ab") in larger designs, meaning it falls between two distinct groups.
On shared letters and statistical power: T2 and T5 share "c" because their difference (0.057) is smaller than HSD (0.089). This has two possible interpretations. The treatments may genuinely be similar in their effect on SOC. Alternatively, the experiment may not have had enough power to detect a real difference at this sample size and variance. Three replications is common in pot studies but rarely achieves high power. The CLD cannot tell you which interpretation is correct — only larger experiments can.
Multiple letters in larger designs: In a seven or eight treatment design, a treatment may carry letters "bc," meaning it does not differ significantly from groups holding "b" or "c," even if "b" and "c" groups differ from each other. This happens when a treatment falls between two statistically distinct clusters. The HSD.test() function in R's agricolae package assigns these multi-letter groups automatically.
6Running It in R
The complete R code for the dataset above, from data entry through Tukey HSD with CLD:
# Soil organic carbon — 5 treatments, 3 reps, CRD soc <- c(0.91, 0.88, 0.94, # T1 Control 1.24, 1.18, 1.21, # T2 Compost 1.38, 1.42, 1.35, # T3 Biochar 1.67, 1.71, 1.64, # T4 Compost+Biochar 1.15, 1.12, 1.19) # T5 Cow dung trt <- factor(rep(c("T1_Control", "T2_Compost", "T3_Biochar", "T4_Compost_Biochar", "T5_Cowdung"), each = 3)) df <- data.frame(trt, soc) # ANOVA model <- aov(soc ~ trt, data = df) summary(model) # Assumption checks shapiro.test(residuals(model)) library(car); leveneTest(soc ~ trt, data = df) # Tukey HSD + compact letter display library(agricolae) result <- HSD.test(model, "trt", group = TRUE, console = TRUE) # The HSD threshold value print(result$statistics$MSD) # MSD = Minimum Significant Difference = HSD
7How to Report in a Manuscript
The reporting convention in agricultural journals is to state the F-statistic with degrees of freedom, the p-value, the post-hoc test used, the alpha level, and then describe the pattern using CLD logic. The CLD column goes in the results table alongside means and standard errors.
"Soil organic carbon differed significantly across amendment treatments (F(4,10) = 219.5, p < 0.001). Tukey HSD post-hoc comparison (α = 0.05) showed that the combined Compost + Biochar treatment (T4) had the highest SOC at 1.67%, significantly exceeding all other treatments. Biochar alone (T3, 1.38%) ranked second and differed significantly from all others. Compost alone (T2, 1.21%) and Cow dung (T5, 1.15%) did not differ significantly from each other, but both were significantly higher than the unfertilized control (T1, 0.91%). Mean values followed by different letters in Table 2 are significantly different at p < 0.05 (Tukey HSD)."
The sentence "Mean values followed by different letters are significantly different" must appear somewhere in the table note or main text the first time a CLD table is presented. Many manuscripts omit this explanation, and reviewers flag the omission.
Three Mistakes to Avoid
Running Tukey HSD without a significant ANOVA first. Tukey HSD is a protected post-hoc test. If the ANOVA is not significant, no post-hoc comparison is warranted — the ANOVA already tells you there is no detectable difference among group means. Running Tukey anyway is methodologically indefensible and reviewers will flag it.
Treating shared letters as proof of equality. T2 and T5 sharing the letter "c" does not prove their SOC values are equal. It means the difference between them (0.057%) was smaller than the HSD threshold (0.089%) at the achieved sample size and variance. With more replications, the HSD would decrease and the same difference might become significant. Low statistical power is a common reason for shared letters between treatments that a researcher believes should separate.
Using Tukey when Dunnett's would be more powerful. If your research question is specifically "which treatments outperform the control?" and you have no interest in treatment-to-treatment comparisons, Dunnett's test makes fewer comparisons and can detect smaller differences against the control than Tukey can. Using Tukey in this situation wastes statistical power. Dunnett's is available in R via glht(model, mcp(trt = "Dunnett")) from the multcomp package.
Sajjadur Rahman
MSc Researcher · Data Analyst · Developer of SPADE · University of DhakaNST Fellow and researcher who runs ANOVA and Tukey HSD across multiple active experiments. Developer of SPADE, an open-source Python/Streamlit platform that automates NUE computation, factorial ANOVA with Tukey HSD, compact letter display generation, and publication-ready figure export. Available for data analysis, manuscript editing, and thesis consultation.