Researcher · Writer · Consultant · Dhaka, Bangladesh গবেষক · লেখক · পরামর্শদাতা · ঢাকা, বাংলাদেশ
Home About Services Research Blog Contact
Blog Data Analysis

Tukey HSD Explained: When to Use It and How to Interpret the Results

ANOVA tells you at least one group mean differs. Tukey HSD tells you which ones. A complete guide to the test, the formula, all the alternatives, a fully worked five-treatment example, and how to read the compact letter display correctly.

If you have run a one-way ANOVA and it returned a significant p-value, you have answered one question: at least one group mean differs from the rest. You have not answered the question most researchers actually want answered: which groups differ, and by how much? Tukey HSD is the test that answers that.

This post explains what Tukey HSD is, the specific problem it was designed to solve, when it is the right choice and when it is not, and how to read and report its output correctly. It uses a different dataset from the ANOVA tutorial on this site so the two posts work as independent references.

1The Multiple Comparisons Problem

Suppose you have five treatment groups in a soil amendment experiment. After a significant ANOVA, you want to compare every possible pair of group means. With five groups, that is ten pairs: T1 vs T2, T1 vs T3, T1 vs T4, T1 vs T5, T2 vs T3, T2 vs T4, T2 vs T5, T3 vs T4, T3 vs T5, and T4 vs T5.

If you ran a separate t-test for each of those ten pairs at α = 0.05, the chance of a false positive in any individual test is 5 percent. But the chance of getting at least one false positive somewhere across ten tests is much higher. The calculation is 1 minus 0.95 to the power of 10, which equals roughly 40 percent. Run ten tests and there is a 40 percent chance that at least one of your "significant" results is wrong.

Family-Wise Error Rate — How It Inflates with More Groups
3 groups
3 pairs
14%
4 groups
6 pairs
26%
5 groups
10 pairs
40%
7 groups
21 pairs
66%
10 groups
45 pairs
90%

Probability of at least one false positive when running all pairwise t-tests at α = 0.05. Tukey HSD holds this rate at 5% across all comparisons simultaneously.

This is called family-wise error rate (FWER) inflation. The more comparisons you make, the more likely you are to flag a difference as significant when none exists. Tukey HSD controls this. It computes a single critical value, the Honestly Significant Difference, that holds the FWER at α = 0.05 across all pairwise comparisons simultaneously, regardless of how many groups are in the design.

2What Tukey HSD Calculates

The HSD is the minimum difference between two group means required for the comparison to be statistically significant at your chosen alpha level, accounting for the total number of comparisons being made simultaneously.

The Tukey HSD Formula Honestly Significant Difference
HSD = q(α, k, dferror)  ×  √(MSE / n) Compare every pairwise difference |x̄ᵢ − x̄ⱼ| against HSD. If it exceeds HSD, the pair is significantly different at α.
q Critical value from the studentized range distribution, looked up using α, k, and df_error. This is not the same as the t-distribution — it accounts for the range of all group means at once.
k Number of treatment groups in the design.
df_error Error degrees of freedom from the ANOVA table. Equal to N − k, where N is total observations.
MSE Mean Square Error from the ANOVA table. This is the within-group variance, measuring how much individual observations vary around their group means.
n Number of observations per group. Assumes equal sample sizes. For unequal sizes, R uses the Tukey-Kramer adjustment automatically.

Once HSD is computed, every pairwise difference is compared against it. If the absolute difference between two group means exceeds HSD, those groups are significantly different. The single HSD value acts as a universal threshold for the entire family of comparisons, which is how it controls the family-wise error rate.

3When to Use It — and When Not To

Tukey HSD is the right choice for most agricultural experiments. But several alternatives exist for specific situations, and choosing the wrong one will draw questions from reviewers.

Test Best for Conservatism Use?
Tukey HSD All pairwise comparisons, equal or near-equal sample sizes. Most agricultural CRD and RCBD designs. Moderate Recommended
Tukey-Kramer All pairwise comparisons with unequal sample sizes. Adjustment applied automatically by R and SPSS. Moderate Use when n differs
Bonferroni Small number of pre-specified comparisons (3 to 5). More powerful than Tukey when comparisons are few and planned. High For few, pre-planned pairs
Dunnett's Comparing all treatments to a single control only. Do not use if you also want treatment-to-treatment comparisons. Moderate Control-only comparisons
Duncan's MRT Appears in older agricultural literature. Less conservative than Tukey; harder to defend methodologically to modern reviewers. Low Avoid in new manuscripts
Fisher's LSD Only valid as a protected test after significant ANOVA, and only with exactly three groups. With more groups it inflates Type I error. Very low 3 groups only

In practice: if your experiment has three or more treatment groups, all pairwise comparisons are of interest, and sample sizes are equal or close to equal, use Tukey HSD. If your only question is "which treatments beat the control?", use Dunnett's. Both are available in R's agricolae and base R packages.

— ❧ —

4A Worked Example

The dataset below comes from a CRD pot experiment testing the effect of five organic amendment treatments on soil organic carbon (%) in Bangladeshi paddy soil, with three replications per treatment.

Treatment Rep 1 Rep 2 Rep 3 Mean (%)
T1 — Control (no amendment) 0.910.880.94 0.91
T2 — Compost 2 t/ha 1.241.181.21 1.21
T3 — Biochar 2 t/ha 1.381.421.35 1.38
T4 — Compost + Biochar 1.671.711.64 1.67
T5 — Cow dung 5 t/ha 1.151.121.19 1.15

The one-way ANOVA returns F(4,10) = 219.5, p < 0.001. At least one group mean differs significantly. Now computing Tukey HSD:

MSE from the ANOVA = 0.0011 · n per group = 3 · k = 5 groups · df_error = 10
Critical q at α = 0.05: q(0.05, 5, 10) = 4.65
HSD = 4.65 × √(0.0011 / 3) = 4.65 × 0.0191 = 0.089

Any two group means differing by more than 0.089 are significantly different at p < 0.05. The ten pairwise comparisons:

Pair Difference vs HSD (0.089) Significant?
T4 vs T30.290> 0.089YES
T4 vs T20.463> 0.089YES
T4 vs T50.520> 0.089YES
T4 vs T10.763> 0.089YES
T3 vs T20.173> 0.089YES
T3 vs T50.230> 0.089YES
T3 vs T10.473> 0.089YES
T2 vs T50.057< 0.089NO
T2 vs T10.300> 0.089YES
T5 vs T10.243> 0.089YES

Nine of ten pairs are significantly different. T2 (Compost alone) and T5 (Cow dung) are not — their difference of 0.057 falls below the HSD of 0.089.

5Reading the Compact Letter Display

Agricultural journals report Tukey results as a compact letter display (CLD). Groups sharing a letter are not significantly different. Groups not sharing a letter are. Based on the ten comparisons above:

Treatment Mean SOC (%) CLD
T4 — Compost + Biochar 1.67 a
T3 — Biochar 2 t/ha 1.38 b
T2 — Compost 2 t/ha 1.21 c
T5 — Cow dung 5 t/ha 1.15 c
T1 — Control 0.91 d

T4 carries only "a": it is significantly higher than every other treatment. T3 carries "b": it differs from T4, T2, T5, and T1. T2 and T5 both carry "c": they are not significantly different from each other, but both differ from T4, T3, and T1. T1 carries "d": it is significantly lower than all fertilized treatments.

Two things about the CLD that frequently confuse readers. Sharing a letter means "not significantly different." It does not mean equal. And a treatment can carry multiple letters (like "ab") in larger designs, meaning it falls between two distinct groups.

On shared letters and statistical power: T2 and T5 share "c" because their difference (0.057) is smaller than HSD (0.089). This has two possible interpretations. The treatments may genuinely be similar in their effect on SOC. Alternatively, the experiment may not have had enough power to detect a real difference at this sample size and variance. Three replications is common in pot studies but rarely achieves high power. The CLD cannot tell you which interpretation is correct — only larger experiments can.

Multiple letters in larger designs: In a seven or eight treatment design, a treatment may carry letters "bc," meaning it does not differ significantly from groups holding "b" or "c," even if "b" and "c" groups differ from each other. This happens when a treatment falls between two statistically distinct clusters. The HSD.test() function in R's agricolae package assigns these multi-letter groups automatically.

6Running It in R

The complete R code for the dataset above, from data entry through Tukey HSD with CLD:

R
# Soil organic carbon — 5 treatments, 3 reps, CRD

soc <- c(0.91, 0.88, 0.94,  # T1 Control
        1.24, 1.18, 1.21,  # T2 Compost
        1.38, 1.42, 1.35,  # T3 Biochar
        1.67, 1.71, 1.64,  # T4 Compost+Biochar
        1.15, 1.12, 1.19) # T5 Cow dung

trt <- factor(rep(c("T1_Control", "T2_Compost",
                  "T3_Biochar", "T4_Compost_Biochar",
                  "T5_Cowdung"), each = 3))

df <- data.frame(trt, soc)

# ANOVA
model <- aov(soc ~ trt, data = df)
summary(model)

# Assumption checks
shapiro.test(residuals(model))
library(car); leveneTest(soc ~ trt, data = df)

# Tukey HSD + compact letter display
library(agricolae)
result <- HSD.test(model, "trt", group = TRUE, console = TRUE)

# The HSD threshold value
print(result$statistics$MSD)  # MSD = Minimum Significant Difference = HSD

7How to Report in a Manuscript

The reporting convention in agricultural journals is to state the F-statistic with degrees of freedom, the p-value, the post-hoc test used, the alpha level, and then describe the pattern using CLD logic. The CLD column goes in the results table alongside means and standard errors.

Example manuscript text

"Soil organic carbon differed significantly across amendment treatments (F(4,10) = 219.5, p < 0.001). Tukey HSD post-hoc comparison (α = 0.05) showed that the combined Compost + Biochar treatment (T4) had the highest SOC at 1.67%, significantly exceeding all other treatments. Biochar alone (T3, 1.38%) ranked second and differed significantly from all others. Compost alone (T2, 1.21%) and Cow dung (T5, 1.15%) did not differ significantly from each other, but both were significantly higher than the unfertilized control (T1, 0.91%). Mean values followed by different letters in Table 2 are significantly different at p < 0.05 (Tukey HSD)."

The sentence "Mean values followed by different letters are significantly different" must appear somewhere in the table note or main text the first time a CLD table is presented. Many manuscripts omit this explanation, and reviewers flag the omission.

Three Mistakes to Avoid

Mistake 1

Running Tukey HSD without a significant ANOVA first. Tukey HSD is a protected post-hoc test. If the ANOVA is not significant, no post-hoc comparison is warranted — the ANOVA already tells you there is no detectable difference among group means. Running Tukey anyway is methodologically indefensible and reviewers will flag it.

Mistake 2

Treating shared letters as proof of equality. T2 and T5 sharing the letter "c" does not prove their SOC values are equal. It means the difference between them (0.057%) was smaller than the HSD threshold (0.089%) at the achieved sample size and variance. With more replications, the HSD would decrease and the same difference might become significant. Low statistical power is a common reason for shared letters between treatments that a researcher believes should separate.

Mistake 3

Using Tukey when Dunnett's would be more powerful. If your research question is specifically "which treatments outperform the control?" and you have no interest in treatment-to-treatment comparisons, Dunnett's test makes fewer comparisons and can detect smaller differences against the control than Tukey can. Using Tukey in this situation wastes statistical power. Dunnett's is available in R via glht(model, mcp(trt = "Dunnett")) from the multcomp package.

All Posts One-Way ANOVA in R →
SR

Sajjadur Rahman

MSc Researcher · Data Analyst · Developer of SPADE · University of Dhaka

NST Fellow and researcher who runs ANOVA and Tukey HSD across multiple active experiments. Developer of SPADE, an open-source Python/Streamlit platform that automates NUE computation, factorial ANOVA with Tukey HSD, compact letter display generation, and publication-ready figure export. Available for data analysis, manuscript editing, and thesis consultation.

💬