How Many Replicates Do You Actually Need for a Pot Experiment?

The Real Stakes of Getting This Wrong

Before my 24-pot CRD experiment on BARI Gom 33 wheat at DUNTC, the most useful piece of experimental design advice I received was: "use three or four." Three or four replicates. That was it. The reasoning, such as it was, involved tradition, space constraints, and the fact that everyone else in the department also used three or four. A power analysis was not mentioned.

Then a hailstorm hit mid-season and destroyed four pots. You don't get a do-over in a greenhouse experiment. You work with what survives, you run your ANOVA on reduced data, and you spend a long time wondering whether the non-significant result you got means the treatment didn't work or means you simply didn't have enough pots to see that it did.

That uncertainty — between "the effect is not there" and "the effect is there but I couldn't detect it" — is exactly what a power analysis is designed to prevent. It's the difference between planning your experiment and hoping your experiment. The cost of getting it wrong is measured in pots, soil, fertilizer, greenhouse time, months of your degree, and the specific kind of frustration that comes from results you can't trust in either direction.

So. How many replicates do you actually need?

The answer depends on four numbers, and you have to supply all four before the experiment starts — not after.

The Four Levers

Understanding these conceptually before touching any formula will save you from running the calculation without knowing what you're asking.

Lever 1 Effect Size (f) The minimum difference between treatment means that you actually care about detecting, expressed as a ratio of the between-group variability to the within-group variability. You choose this. The data can't tell you. ↑ smaller effect → more pots

Lever 2 Variability (σ) How noisy your measurements are within a treatment group — the within-group standard deviation. Estimated from a pilot run, a similar published experiment, or a literature CV for the crop and variable. ↑ noisier data → more pots

Lever 3 Significance Level (α) The false positive rate — the probability of concluding there's an effect when there isn't one. The 0.05 convention is arbitrary but standard. Lowering it to 0.01 requires more pots. ↓ stricter α → more pots

Lever 4 Power (1−β) Your probability of detecting a real effect when it exists. The standard floor is 0.80 — meaning you accept a 1-in-5 chance of missing a true effect. Power = 0.90 requires more pots. ↑ higher power → more pots

The effect size lever (f) deserves the most attention because it's the one researchers most often set badly. Cohen's f — the conventional measure for ANOVA — is the standard deviation of the treatment group means divided by the within-group standard deviation. You calculate it by deciding what means you realistically expect to see and what the background variability is. Both of those come from your brain and your literature search, not from the data you haven't yet collected.

Cohen's conventional thresholds are f = 0.10 (small), f = 0.25 (medium), and f = 0.40 (large). In controlled greenhouse and pot conditions, agricultural effects tend to produce large f values because the within-pot variability is lower than field variability. This is one reason why pot experiments work at all with only 4–6 replicates. But "tends to produce large effects" is not a power analysis.

The honest calibration question

Before calculating n, ask yourself: if the treatment worked perfectly as expected, what would the group means look like, and what within-group SD would I expect? Write those numbers down. Your f is derived from them. If you can't answer those questions from the literature, the power analysis will tell you nothing useful — and that is information worth having before you plant.

Running the Power Analysis in R

The pwr package handles power calculations for common statistical tests, including one-way ANOVA. Install it once, then use pwr.anova.test() to find any one of the four parameters given the other three.

The running example: a 4-treatment CRD on BARI Gom 33 wheat, testing 0, 40, 80, and 120 kg N/ha as urea. From published studies on similar wheat varieties in Bangladesh, you estimate that grain yield will follow a diminishing-returns pattern with N: means approximately 2.8, 3.5, 3.9, 4.1 t/ha. You also estimate that within any single nitrogen treatment, pot-to-pot variation — due to minor soil differences, pot placement effects, and measurement error — will produce a standard deviation of about 0.55 t/ha. That SD is your estimate of σ.

# Install once, then load
# install.packages("pwr")
library(pwr)

# Step 1: calculate Cohen's f from expected means and within-group SD
k        <- 4                           # number of treatments
mu       <- c(2.8, 3.5, 3.9, 4.1)      # expected group means (t/ha)
sigma    <- 0.55                         # within-group SD (t/ha) from literature

mu_grand <- mean(mu)
sigma_m  <- sqrt(mean((mu - mu_grand)^2))  # SD of the group means
f        <- sigma_m / sigma

cat("Grand mean:  ", round(mu_grand, 2), "t/ha\n")
cat("sigma_m:     ", round(sigma_m, 3), "\n")
cat("Cohen's f:   ", round(f, 3), "\n\n")

# Step 2: find required n per group
pwr.anova.test(k = k, f = f, sig.level = 0.05, power = 0.80)

Output

Grand mean:   3.57 t/ha
sigma_m:      0.497
Cohen's f:    0.903

     Balanced one-way analysis of variance power calculation

              k = 4
              n = 4.449
              f = 0.903
      sig.level = 0.05
          power = 0.8

NOTE: n is number in each group

Round up to 5 pots per group, giving 20 pots total. With the expected effect size in this scenario (f = 0.90, which is large — a consequence of the strong N response and controlled pot conditions), you need fewer replicates than you might expect. Your design is efficient precisely because you're working in a controlled environment.

But notice how sensitive this is to σ. If the within-group variability is higher than you estimated — say σ = 0.75 t/ha instead of 0.55, because pot placement near the greenhouse wall or edge effects turn out to matter more than you thought:

# Same means, higher within-group SD
sigma_high <- 0.75
f_high     <- sigma_m / sigma_high

cat("Cohen's f (higher variability):", round(f_high, 3), "\n")
pwr.anova.test(k = k, f = f_high, sig.level = 0.05, power = 0.80)

Output

Cohen's f (higher variability): 0.662

              n = 7.265

NOTE: n is number in each group

Now you need 8 pots per group — 32 total. The same expected treatment effect, but noisier conditions, and the required n jumps by 60%. This is why your estimate of σ matters as much as your estimate of the treatment means, and why a small pilot run of 3–4 pots is worth the investment if you have no good literature estimate.

How Required n Responds to Effect Size

The most instructive thing you can do with a power analysis is run it across a range of f values to see how quickly the required sample size grows as you try to detect smaller effects. The table below spans from a modest medium effect to a large one, at k = 4 treatments with 80% power and α = 0.05:

# Sensitivity: required n as Cohen's f changes
f_values <- c(0.25, 0.35, 0.50, 0.65, 0.80, 1.00)

for (f_val in f_values) {
  n_req <- ceiling(pwr.anova.test(k = 4, f = f_val,
                                     sig.level = 0.05, power = 0.80)$n)
  cat(sprintf("f = %.2f | n = %2d per group | %3d pots total\n",
              f_val, n_req, n_req * 4))
}

Output

f = 0.25 | n = 45 per group | 180 pots total
f = 0.35 | n = 24 per group |  96 pots total
f = 0.50 | n = 12 per group |  48 pots total
f = 0.65 | n =  8 per group |  32 pots total
f = 0.80 | n =  6 per group |  24 pots total
f = 1.00 | n =  4 per group |  16 pots total

Cohen's f	Effect Category	n per Group	Total Pots (k=4)	Feasibility
0.25	Medium	45	180	Impractical for most setups
0.35	Medium–Large	24	96	Very challenging
0.50	Large	12	48	Challenging but possible
0.65	Large	8	32	Feasible for most labs
0.80	Very Large	6	24	Standard pot experiment range
1.00	Very Large	4	16	Minimum for large expected effects

The table makes the underlying logic visible. The 4–6 replicates that tradition prescribes are only appropriate when you expect a very large effect size (f ≥ 0.80). In highly controlled pot experiments with strong N response curves, that expectation is often justified — which is why the tradition holds. But if you're testing a subtle micronutrient interaction or a soil amendment that produces modest yield responses, 4–6 replicates may leave you fundamentally unable to see the effect you're looking for.

The Reverse Calculation — Working Back from Fixed Pots

Most graduate students don't choose their pot count from scratch. They inherit a greenhouse bay with space for 24 pots, or a supervisor who says "you can have 30 pots maximum." The more useful question then is not "how many do I need?" but "given what I have, what can I actually detect?"

You run the same function, supplying n instead of leaving it empty, and solve for f:

# Reverse: minimum detectable f at different n values
n_values <- c(3, 4, 5, 6, 8, 10)

for (n_val in n_values) {
  f_min <- pwr.anova.test(k = 4, n = n_val,
                            sig.level = 0.05, power = 0.80)$f
  cat(sprintf("n = %2d per group (%2d pots): min detectable f = %.2f\n",
              n_val, n_val * 4, f_min))
}

Output

n =  3 per group (12 pots): min detectable f = 1.22
n =  4 per group (16 pots): min detectable f = 0.97
n =  5 per group (20 pots): min detectable f = 0.84
n =  6 per group (24 pots): min detectable f = 0.74
n =  8 per group (32 pots): min detectable f = 0.63
n = 10 per group (40 pots): min detectable f = 0.55

Read this table as a contract between you and your design. With 6 pots per treatment (24 pots total), you can reliably detect effects where f ≥ 0.74. To know whether that threshold is achievable in your experiment, you translate it back into yield units. If σ = 0.55 t/ha, then an f of 0.74 means the standard deviation of the group means needs to be at least 0.74 × 0.55 = 0.41 t/ha. In the BARI Gom 33 context, with a full N response curve from 0 to 120 kg/ha, that's a reasonable expectation. If you were testing two slow-release fertilizer formulations that differ only slightly in their N release profile, f = 0.74 would be far too optimistic and 24 pots would be a recipe for undetectable effects.

Knowing this ahead of planting is more useful than discovering it after harvest. If your reverse calculation shows your available pots can only detect very large effects, you have three options: reduce the number of treatments (fewer groups → more pots per group), constrain the scope of the comparison (test your two most contrasting treatments rather than five), or be honest in your thesis that the study is exploratory and powered to detect only large effects.

Buffer Pots and the Hailstorm Lesson

The power analysis gives you the minimum n for the experiment you're planning. It does not account for the experiment you're actually running.

Pots fail. A fungal infection wipes out two replicates in the highest-N treatment. A pipe burst in mid-January floods one end of the greenhouse bay. A hailstorm — and yes, this happens to covered and semi-covered structures at DUNTC — dislodges several pots and compromises data quality in a way you only discover during harvest. Any experiment that runs from planting to harvest over 90–120 days is exposed to a hundred small disasters that your power calculation didn't model.

Plan for 15–20% more pots than your power analysis requires. You're not padding unnecessarily — you're buying a minimum viable experiment even if things go wrong.

In practice: if your power analysis says 6 pots per group, plan for 7 or 8. If it says 20 total, prepare soil and procure materials for 24–26. The incremental cost of an extra pot at the planning stage is trivial compared to the cost of running a 90-day experiment and arriving at a result you can't defend.

For CRD experiments specifically, losing even one or two pots from one treatment group creates an unbalanced design. That's not catastrophic — unbalanced ANOVA is straightforward in R using aov() with Type III sums of squares — but it reduces your power in proportion to the imbalance, which is exactly the opposite direction from where you want to be going.

The Pre-Planting Checklist

The discipline of running a power analysis is not the calculation itself. It's the discipline of making all four decisions explicitly, before any seeds go into soil, instead of justifying them after harvest. Here is what that looks like as a concrete sequence:

1
Decide the minimum meaningful effect size. From the published literature on similar crops, management practices, and measurement variables, estimate what treatment means you realistically expect. Write them down. This is your f numerator.
2
Estimate within-group variability. Look for CV or SD values reported in similar pot experiments. As a rule of thumb, pot experiments in controlled environments typically have CV of 8–15% for grain yield. Apply that to your expected grand mean. This is σ, your f denominator.
3
Set your significance level and power. α = 0.05 and power = 0.80 are the field standards, and departing from them requires a specific reason. If your experiment has serious resource constraints, document that the study is powered for large effects only.
4
Run pwr.anova.test(). Get n. Round up. Multiply by k. That's your baseline total pot count.
5
Add 15–20% buffer pots. Procure soil, fertilizer, and materials for this inflated total. Prepare them all. You may not need the extras, and that's fine.
6
If fixed constraints apply, run the reverse calculation. Check whether the minimum detectable f is realistic for your system. If it isn't — if your available pots can only detect effects so large they'd be visible to the naked eye at harvest — revise the experimental question before planting.

None of this is mathematically difficult. What it requires is the willingness to make explicit commitments before the data exist — to say, "I believe this treatment will produce roughly this much difference, with this much within-group variation, and I'm going to design around that belief." That is the scientific discipline that separates an experiment from an experience.

If you find that your available resources can only detect very large effects, that's useful information too. Write a thesis chapter that uses a controlled, well-powered comparison of two treatments instead of an underpowered comparison of six. The quality of evidence you produce matters more than the number of treatments you test.

The p-value post explains what α and power mean in terms of the errors they control — false positives and false negatives. The ANOVA tutorial covers what you do with the data once you've collected it. This post is what goes in between: the moment before planting when you commit to a design that gives you a fair chance of finding what you're looking for.

✦ ✦ ✦

If you're designing a pot experiment and want the power analysis run for your specific crop, variable, and constraint, that's the kind of pre-experiment consultation the data analysis service covers — including the experimental design, not just the statistics at the end.

Sajjadur Rahman

MSc Researcher · University of Dhaka · DUNTC Field Experiments

NST Fellow who has run CRD and RCBD pot experiments on BARI Gom 33 and other varieties under controlled and semi-controlled conditions. Developer of SPADE. Available for pre-experiment design consultation — power analysis, replicate planning, and CRD/RCBD layout — as well as post-harvest data analysis.

Data Analysis SPADE Tool My Research Contact