The Real Stakes of Getting This Wrong
Before my 24-pot CRD experiment on BARI Gom 33 wheat at DUNTC, the most useful piece of experimental design advice I received was: "use three or four." Three or four replicates. That was it. The reasoning, such as it was, involved tradition, space constraints, and the fact that everyone else in the department also used three or four. A power analysis was not mentioned.
Then a hailstorm hit mid-season and destroyed four pots. You don't get a do-over in a greenhouse experiment. You work with what survives, you run your ANOVA on reduced data, and you spend a long time wondering whether the non-significant result you got means the treatment didn't work or means you simply didn't have enough pots to see that it did.
That uncertainty — between "the effect is not there" and "the effect is there but I couldn't detect it" — is exactly what a power analysis is designed to prevent. It's the difference between planning your experiment and hoping your experiment. The cost of getting it wrong is measured in pots, soil, fertilizer, greenhouse time, months of your degree, and the specific kind of frustration that comes from results you can't trust in either direction.
So. How many replicates do you actually need?
The answer depends on four numbers, and you have to supply all four before the experiment starts — not after.
The Four Levers
Understanding these conceptually before touching any formula will save you from running the calculation without knowing what you're asking.
The effect size lever (f) deserves the most attention because it's the one researchers most often set badly. Cohen's f — the conventional measure for ANOVA — is the standard deviation of the treatment group means divided by the within-group standard deviation. You calculate it by deciding what means you realistically expect to see and what the background variability is. Both of those come from your brain and your literature search, not from the data you haven't yet collected.
Cohen's conventional thresholds are f = 0.10 (small), f = 0.25 (medium), and f = 0.40 (large). In controlled greenhouse and pot conditions, agricultural effects tend to produce large f values because the within-pot variability is lower than field variability. This is one reason why pot experiments work at all with only 4–6 replicates. But "tends to produce large effects" is not a power analysis.
Before calculating n, ask yourself: if the treatment worked perfectly as expected, what would the group means look like, and what within-group SD would I expect? Write those numbers down. Your f is derived from them. If you can't answer those questions from the literature, the power analysis will tell you nothing useful — and that is information worth having before you plant.
Running the Power Analysis in R
The pwr package handles power calculations for common statistical tests, including one-way ANOVA. Install it once, then use pwr.anova.test() to find any one of the four parameters given the other three.
The running example: a 4-treatment CRD on BARI Gom 33 wheat, testing 0, 40, 80, and 120 kg N/ha as urea. From published studies on similar wheat varieties in Bangladesh, you estimate that grain yield will follow a diminishing-returns pattern with N: means approximately 2.8, 3.5, 3.9, 4.1 t/ha. You also estimate that within any single nitrogen treatment, pot-to-pot variation — due to minor soil differences, pot placement effects, and measurement error — will produce a standard deviation of about 0.55 t/ha. That SD is your estimate of σ.
# Install once, then load # install.packages("pwr") library(pwr) # Step 1: calculate Cohen's f from expected means and within-group SD k <- 4 # number of treatments mu <- c(2.8, 3.5, 3.9, 4.1) # expected group means (t/ha) sigma <- 0.55 # within-group SD (t/ha) from literature mu_grand <- mean(mu) sigma_m <- sqrt(mean((mu - mu_grand)^2)) # SD of the group means f <- sigma_m / sigma cat("Grand mean: ", round(mu_grand, 2), "t/ha\n") cat("sigma_m: ", round(sigma_m, 3), "\n") cat("Cohen's f: ", round(f, 3), "\n\n") # Step 2: find required n per group pwr.anova.test(k = k, f = f, sig.level = 0.05, power = 0.80)
Grand mean: 3.57 t/ha sigma_m: 0.497 Cohen's f: 0.903 Balanced one-way analysis of variance power calculation k = 4 n = 4.449 f = 0.903 sig.level = 0.05 power = 0.8 NOTE: n is number in each group
Round up to 5 pots per group, giving 20 pots total. With the expected effect size in this scenario (f = 0.90, which is large — a consequence of the strong N response and controlled pot conditions), you need fewer replicates than you might expect. Your design is efficient precisely because you're working in a controlled environment.
But notice how sensitive this is to σ. If the within-group variability is higher than you estimated — say σ = 0.75 t/ha instead of 0.55, because pot placement near the greenhouse wall or edge effects turn out to matter more than you thought:
# Same means, higher within-group SD sigma_high <- 0.75 f_high <- sigma_m / sigma_high cat("Cohen's f (higher variability):", round(f_high, 3), "\n") pwr.anova.test(k = k, f = f_high, sig.level = 0.05, power = 0.80)
Cohen's f (higher variability): 0.662 n = 7.265 NOTE: n is number in each group
Now you need 8 pots per group — 32 total. The same expected treatment effect, but noisier conditions, and the required n jumps by 60%. This is why your estimate of σ matters as much as your estimate of the treatment means, and why a small pilot run of 3–4 pots is worth the investment if you have no good literature estimate.
How Required n Responds to Effect Size
The most instructive thing you can do with a power analysis is run it across a range of f values to see how quickly the required sample size grows as you try to detect smaller effects. The table below spans from a modest medium effect to a large one, at k = 4 treatments with 80% power and α = 0.05:
# Sensitivity: required n as Cohen's f changes f_values <- c(0.25, 0.35, 0.50, 0.65, 0.80, 1.00) for (f_val in f_values) { n_req <- ceiling(pwr.anova.test(k = 4, f = f_val, sig.level = 0.05, power = 0.80)$n) cat(sprintf("f = %.2f | n = %2d per group | %3d pots total\n", f_val, n_req, n_req * 4)) }
f = 0.25 | n = 45 per group | 180 pots total f = 0.35 | n = 24 per group | 96 pots total f = 0.50 | n = 12 per group | 48 pots total f = 0.65 | n = 8 per group | 32 pots total f = 0.80 | n = 6 per group | 24 pots total f = 1.00 | n = 4 per group | 16 pots total
| Cohen's f | Effect Category | n per Group | Total Pots (k=4) | Feasibility |
|---|---|---|---|---|
| 0.25 | Medium | 45 | 180 | Impractical for most setups |
| 0.35 | Medium–Large | 24 | 96 | Very challenging |
| 0.50 | Large | 12 | 48 | Challenging but possible |
| 0.65 | Large | 8 | 32 | Feasible for most labs |
| 0.80 | Very Large | 6 | 24 | Standard pot experiment range |
| 1.00 | Very Large | 4 | 16 | Minimum for large expected effects |
The table makes the underlying logic visible. The 4–6 replicates that tradition prescribes are only appropriate when you expect a very large effect size (f ≥ 0.80). In highly controlled pot experiments with strong N response curves, that expectation is often justified — which is why the tradition holds. But if you're testing a subtle micronutrient interaction or a soil amendment that produces modest yield responses, 4–6 replicates may leave you fundamentally unable to see the effect you're looking for.
The Reverse Calculation — Working Back from Fixed Pots
Most graduate students don't choose their pot count from scratch. They inherit a greenhouse bay with space for 24 pots, or a supervisor who says "you can have 30 pots maximum." The more useful question then is not "how many do I need?" but "given what I have, what can I actually detect?"
You run the same function, supplying n instead of leaving it empty, and solve for f:
# Reverse: minimum detectable f at different n values n_values <- c(3, 4, 5, 6, 8, 10) for (n_val in n_values) { f_min <- pwr.anova.test(k = 4, n = n_val, sig.level = 0.05, power = 0.80)$f cat(sprintf("n = %2d per group (%2d pots): min detectable f = %.2f\n", n_val, n_val * 4, f_min)) }
n = 3 per group (12 pots): min detectable f = 1.22 n = 4 per group (16 pots): min detectable f = 0.97 n = 5 per group (20 pots): min detectable f = 0.84 n = 6 per group (24 pots): min detectable f = 0.74 n = 8 per group (32 pots): min detectable f = 0.63 n = 10 per group (40 pots): min detectable f = 0.55
Read this table as a contract between you and your design. With 6 pots per treatment (24 pots total), you can reliably detect effects where f ≥ 0.74. To know whether that threshold is achievable in your experiment, you translate it back into yield units. If σ = 0.55 t/ha, then an f of 0.74 means the standard deviation of the group means needs to be at least 0.74 × 0.55 = 0.41 t/ha. In the BARI Gom 33 context, with a full N response curve from 0 to 120 kg/ha, that's a reasonable expectation. If you were testing two slow-release fertilizer formulations that differ only slightly in their N release profile, f = 0.74 would be far too optimistic and 24 pots would be a recipe for undetectable effects.
Knowing this ahead of planting is more useful than discovering it after harvest. If your reverse calculation shows your available pots can only detect very large effects, you have three options: reduce the number of treatments (fewer groups → more pots per group), constrain the scope of the comparison (test your two most contrasting treatments rather than five), or be honest in your thesis that the study is exploratory and powered to detect only large effects.
Buffer Pots and the Hailstorm Lesson
The power analysis gives you the minimum n for the experiment you're planning. It does not account for the experiment you're actually running.
Pots fail. A fungal infection wipes out two replicates in the highest-N treatment. A pipe burst in mid-January floods one end of the greenhouse bay. A hailstorm — and yes, this happens to covered and semi-covered structures at DUNTC — dislodges several pots and compromises data quality in a way you only discover during harvest. Any experiment that runs from planting to harvest over 90–120 days is exposed to a hundred small disasters that your power calculation didn't model.
Plan for 15–20% more pots than your power analysis requires. You're not padding unnecessarily — you're buying a minimum viable experiment even if things go wrong.
In practice: if your power analysis says 6 pots per group, plan for 7 or 8. If it says 20 total, prepare soil and procure materials for 24–26. The incremental cost of an extra pot at the planning stage is trivial compared to the cost of running a 90-day experiment and arriving at a result you can't defend.
For CRD experiments specifically, losing even one or two pots from one treatment group creates an unbalanced design. That's not catastrophic — unbalanced ANOVA is straightforward in R using aov() with Type III sums of squares — but it reduces your power in proportion to the imbalance, which is exactly the opposite direction from where you want to be going.
The Pre-Planting Checklist
The discipline of running a power analysis is not the calculation itself. It's the discipline of making all four decisions explicitly, before any seeds go into soil, instead of justifying them after harvest. Here is what that looks like as a concrete sequence:
-
1
Decide the minimum meaningful effect size. From the published literature on similar crops, management practices, and measurement variables, estimate what treatment means you realistically expect. Write them down. This is your f numerator.
-
2
Estimate within-group variability. Look for CV or SD values reported in similar pot experiments. As a rule of thumb, pot experiments in controlled environments typically have CV of 8–15% for grain yield. Apply that to your expected grand mean. This is σ, your f denominator.
-
3
Set your significance level and power. α = 0.05 and power = 0.80 are the field standards, and departing from them requires a specific reason. If your experiment has serious resource constraints, document that the study is powered for large effects only.
-
4
Run
pwr.anova.test(). Get n. Round up. Multiply by k. That's your baseline total pot count. -
5
Add 15–20% buffer pots. Procure soil, fertilizer, and materials for this inflated total. Prepare them all. You may not need the extras, and that's fine.
-
6
If fixed constraints apply, run the reverse calculation. Check whether the minimum detectable f is realistic for your system. If it isn't — if your available pots can only detect effects so large they'd be visible to the naked eye at harvest — revise the experimental question before planting.
None of this is mathematically difficult. What it requires is the willingness to make explicit commitments before the data exist — to say, "I believe this treatment will produce roughly this much difference, with this much within-group variation, and I'm going to design around that belief." That is the scientific discipline that separates an experiment from an experience.
If you find that your available resources can only detect very large effects, that's useful information too. Write a thesis chapter that uses a controlled, well-powered comparison of two treatments instead of an underpowered comparison of six. The quality of evidence you produce matters more than the number of treatments you test.
The p-value post explains what α and power mean in terms of the errors they control — false positives and false negatives. The ANOVA tutorial covers what you do with the data once you've collected it. This post is what goes in between: the moment before planting when you commit to a design that gives you a fair chance of finding what you're looking for.
If you're designing a pot experiment and want the power analysis run for your specific crop, variable, and constraint, that's the kind of pre-experiment consultation the data analysis service covers — including the experimental design, not just the statistics at the end.
Sajjadur Rahman
MSc Researcher · University of Dhaka · DUNTC Field ExperimentsNST Fellow who has run CRD and RCBD pot experiments on BARI Gom 33 and other varieties under controlled and semi-controlled conditions. Developer of SPADE. Available for pre-experiment design consultation — power analysis, replicate planning, and CRD/RCBD layout — as well as post-harvest data analysis.