The ANOVA tutorial covered one dependent variable against one categorical factor. The regression tutorial covered one outcome against one or more continuous predictors. Those tools handle most thesis analyses — but once you have ten soil parameters measured across twenty sites and two seasons, you need something different. You do not want ten separate ANOVAs. You want to know which variables move together, what patterns they form, and which sites or seasons drive those patterns. That is what PCA does.
PCA is the most commonly used multivariate method in soil and environmental science, and one of the most commonly misapplied. This tutorial walks through the complete procedure in R using a realistic paddy soil dataset from the Haor basin in Sylhet — ten chemical variables, twelve observations, two sampling seasons. By the end you will understand what each output means, how to interpret loadings and scores, and how to write the results for a manuscript.
When PCA Is the Right Tool
Use PCA when you have multiple correlated continuous variables and you want to reduce them to a smaller number of uncorrelated dimensions that capture most of the variance in the data. The key word is correlated. If your variables are completely independent of each other, PCA produces nothing useful. If they share underlying structure — which soil chemistry variables almost always do — PCA reveals it.
Common applications in environmental and agricultural research include: identifying which soil properties co-vary across sites (usually driven by parent material, drainage, or management); separating seasonal or land-use effects from site-level variation; generating input variables for cluster analysis or classification; and reducing multicollinearity before regression.
PCA is exploratory. It does not test a hypothesis. It does not tell you whether two groups differ significantly — that is what ANOVA does. It tells you about the structure of your data: which variables cluster together, and which observations are similar or different. The statistical test guide in the previous post covers when to use PCA versus when to use a test.
PCA works on a correlation or covariance matrix and requires that all variables be continuous and measured on a meaningful numerical scale. Categorical variables (soil texture class, land use type, treatment group) cannot go into a PCA directly. Convert them to binary dummy variables only if you have a specific reason to do so — and be aware that the interpretation becomes more complex.
What PCA Actually Does
PCA rotates your original variable axes to find the directions of maximum variance in the data. The first principal component (PC1) is the axis that explains the most variance. PC2 is perpendicular to PC1 and explains the next most variance. PC3 is perpendicular to both, and so on. Each PC is a linear combination of the original variables.
The mathematical machinery is eigendecomposition of the correlation matrix (or covariance matrix, but correlation is almost always better for environmental data with mixed units and scales). Each PC corresponds to an eigenvector, and the eigenvalue tells you how much variance that component explains. A component with an eigenvalue above 1.0 explains more variance than any single original variable — this is the Kaiser criterion for deciding how many components to retain.
The loadings are the correlations between each original variable and each PC. A high positive loading means the variable contributes strongly and positively to that component. A high negative loading means strong negative contribution. Loadings close to zero mean the variable is essentially unrelated to that component.
The scores are the coordinates of each observation (each soil sample, in our case) in the new PC coordinate system. They tell you where each sample sits relative to the patterns the PCA identified.
The Dataset — Haor Basin Paddy Soils
The Haor basin in greater Sylhet is one of Bangladesh's largest freshwater wetland systems. Paddy cultivation in the basin alternates between an aman rice crop during the monsoon season (waterlogged conditions, June–November) and a dry period (December–May) when fields are partially drained. This seasonal alternation between flooded and aerobic conditions drives dramatic changes in soil redox chemistry — particularly iron (Fe) and manganese (Mn), which become highly soluble under waterlogging.
The dataset below contains twelve soil samples from six sites, sampled once in the pre-monsoon period (January) and once post-monsoon (October). Ten variables were measured: pH, electrical conductivity (EC in dS/m), organic matter (OM in %), total nitrogen (TN in %), available phosphorus (Av-P in mg/kg), exchangeable potassium (Ex-K in cmol/kg), and four micronutrients — iron (Fe), manganese (Mn), zinc (Zn), and copper (Cu), all in mg/kg.
| Site | Season | pH | EC | OM% | TN% | Av-P | Ex-K | Fe | Mn | Zn | Cu |
|---|---|---|---|---|---|---|---|---|---|---|---|
| S1 | Pre | 6.2 | 0.24 | 2.1 | 0.09 | 8.4 | 0.18 | 195 | 18.2 | 1.2 | 2.1 |
| S1 | Post | 5.8 | 0.31 | 2.6 | 0.11 | 10.2 | 0.22 | 328 | 34.6 | 1.8 | 2.8 |
| S2 | Pre | 6.5 | 0.28 | 2.4 | 0.10 | 11.8 | 0.21 | 210 | 20.1 | 1.4 | 2.3 |
| S2 | Post | 6.0 | 0.38 | 3.0 | 0.13 | 14.6 | 0.28 | 362 | 38.4 | 2.0 | 3.1 |
| S3 | Pre | 6.8 | 0.22 | 1.9 | 0.08 | 7.2 | 0.16 | 182 | 15.8 | 1.1 | 1.9 |
| S3 | Post | 6.3 | 0.29 | 2.3 | 0.10 | 9.8 | 0.20 | 298 | 30.2 | 1.6 | 2.6 |
| S4 | Pre | 5.9 | 0.35 | 3.2 | 0.14 | 18.4 | 0.32 | 248 | 26.4 | 1.9 | 3.2 |
| S4 | Post | 5.5 | 0.48 | 4.0 | 0.18 | 22.6 | 0.40 | 418 | 46.8 | 2.8 | 4.2 |
| S5 | Pre | 6.6 | 0.26 | 2.2 | 0.09 | 9.2 | 0.19 | 204 | 19.4 | 1.3 | 2.2 |
| S5 | Post | 6.1 | 0.34 | 2.8 | 0.12 | 12.4 | 0.25 | 344 | 36.8 | 1.9 | 2.9 |
| S6 | Pre | 7.0 | 0.20 | 1.8 | 0.08 | 6.8 | 0.14 | 176 | 14.6 | 1.0 | 1.8 |
| S6 | Post | 6.5 | 0.28 | 2.2 | 0.10 | 9.4 | 0.18 | 286 | 28.4 | 1.5 | 2.4 |
S4 is the high-organic-matter site (rice straw incorporated for several seasons, lower pH from organic acid accumulation). S6 is sandier with lower nutrient status and higher pH. The post-monsoon samples consistently show lower pH, higher EC, higher Fe and Mn, and higher OM and nutrient values — the signature of waterlogging-driven reduction and organic matter decomposition.
01 Prepare and Scale the Data
Enter the data into R and check it before running PCA. Two things matter here: making sure there are no missing values, and deciding whether to scale.
Scaling (standardising to mean = 0, variance = 1) is almost always necessary in environmental science because variables are on different scales and units. Without scaling, Fe (in mg/kg, values 176–418) would dominate the PCA simply because it has the largest numerical values — not because it explains the most meaningful variance. With scaling, each variable contributes based on its relative variability.
# Enter the dataset soil <- data.frame( site = c("S1","S1","S2","S2","S3","S3","S4","S4","S5","S5","S6","S6"), season = c("Pre","Post","Pre","Post","Pre","Post","Pre","Post","Pre","Post","Pre","Post"), pH = c(6.2, 5.8, 6.5, 6.0, 6.8, 6.3, 5.9, 5.5, 6.6, 6.1, 7.0, 6.5), EC = c(0.24,0.31,0.28,0.38,0.22,0.29,0.35,0.48,0.26,0.34,0.20,0.28), OM = c(2.1, 2.6, 2.4, 3.0, 1.9, 2.3, 3.2, 4.0, 2.2, 2.8, 1.8, 2.2), TN = c(0.09,0.11,0.10,0.13,0.08,0.10,0.14,0.18,0.09,0.12,0.08,0.10), AvP = c(8.4, 10.2,11.8,14.6, 7.2, 9.8,18.4,22.6, 9.2,12.4, 6.8, 9.4), ExK = c(0.18,0.22,0.21,0.28,0.16,0.20,0.32,0.40,0.19,0.25,0.14,0.18), Fe = c(195, 328, 210, 362, 182, 298, 248, 418, 204, 344, 176, 286), Mn = c(18.2,34.6,20.1,38.4,15.8,30.2,26.4,46.8,19.4,36.8,14.6,28.4), Zn = c(1.2, 1.8, 1.4, 2.0, 1.1, 1.6, 1.9, 2.8, 1.3, 1.9, 1.0, 1.5), Cu = c(2.1, 2.8, 2.3, 3.1, 1.9, 2.6, 3.2, 4.2, 2.2, 2.9, 1.8, 2.4) ) # Check for missing values sum(is.na(soil)) # should return 0 # Select only the numeric variables for PCA soil_num <- soil[, c("pH","EC","OM","TN","AvP","ExK","Fe","Mn","Zn","Cu")] # Check correlations — high correlations mean PCA will find structure round(cor(soil_num), 2)
02 Run PCA with prcomp()
R has two PCA functions: prcomp() and princomp(). Use prcomp(). It uses singular value decomposition (SVD), which is numerically more stable than the eigenvalue decomposition that princomp() uses. Set scale. = TRUE to scale variables automatically — this is equivalent to running PCA on the correlation matrix.
# Run PCA — scale. = TRUE standardises variables pca <- prcomp(soil_num, center = TRUE, scale. = TRUE) # Summary: standard deviation, proportion of variance, cumulative proportion summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 3.0323 0.76705 0.43303 0.11347 0.08004 0.06760 0.05622
Proportion of Variance 0.9195 0.05884 0.01875 0.00129 0.00064 0.00046 0.00032
Cumulative Proportion 0.9195 0.97829 0.99704 0.99833 0.99897 0.99943 0.99974
PC8 PC9 PC10
Standard deviation 0.04512 0.01881 0.01281
Proportion of Variance 0.00020 0.00004 0.00002
Cumulative Proportion 0.99995 0.99998 1.00000
PC1 alone explains 91.95% of the total variance. PC2 explains another 5.88%. Together, two components capture 97.83% of the information in ten variables. This is an exceptionally clean PCA result — it tells you that the ten soil properties are essentially varying along one dominant axis, with a secondary axis capturing most of the remaining structure.
This makes biological sense. Waterlogging during the post-monsoon season drives Fe, Mn, EC upward and pH downward simultaneously — one process affecting multiple variables at once. That shared driver is what PC1 is capturing.
03 Scree Plot — How Many Components?
A scree plot graphs the eigenvalues (or variance proportions) for each component. You look for an "elbow" — the point where the curve flattens — and retain the components above it. The Kaiser criterion (eigenvalue > 1.0) is the most commonly used cutoff in environmental science.
# Scree plot — base R plot(pca, type = "l", main = "Scree Plot — Haor Basin Soil PCA") abline(h = 1, lty = 2, col = "red") # Kaiser criterion line # Or with ggplot2 for a publication-ready version library(ggplot2) var_exp <- pca$sdev^2 / sum(pca$sdev^2) * 100 scree_df <- data.frame(PC = paste0("PC", 1:length(var_exp)), Variance = var_exp) scree_df$PC <- factor(scree_df$PC, levels = scree_df$PC) ggplot(scree_df, aes(x = PC, y = Variance, group = 1)) + geom_col(fill = "#8b7e5a", alpha = 0.8, width = 0.6) + geom_line(colour = "#1e1c18", linewidth = 0.8) + geom_point(colour = "#1e1c18", size = 3) + geom_hline(yintercept = 10, linetype = "dashed", colour = "red") + labs(x = "Principal Component", y = "Variance Explained (%)", title = "Scree Plot — Haor Basin Paddy Soil PCA") + theme_classic(base_size = 13) ggsave("haor_scree_plot.png", dpi = 300, width = 7, height = 4.5)
In this dataset, the scree plot shows a sharp drop after PC1 (which explains ~86%) and a much smaller contribution from PC2 (~16%). Components PC3 and beyond contribute less than 6% each and fall below the eigenvalue = 1.0 criterion. Retain PC1 and PC2 — they together explain over 97% of the total variance, which is exceptional for environmental data (60–80% cumulative across 2–3 PCs is more typical).
04 Loadings — What Does Each Component Measure?
Loadings are the most important part of interpreting a PCA. They tell you how strongly each original variable contributes to each PC, and in which direction. A loading > 0.6 in absolute value is generally considered high; > 0.4 is moderate.
# Loadings (rotation matrix) round(pca$rotation[, 1:2], 3) # As a data frame for easier reading loadings_df <- data.frame( Variable = rownames(pca$rotation), PC1 = round(pca$rotation[, 1], 3), PC2 = round(pca$rotation[, 2], 3) ) print(loadings_df)
Variable PC1 PC2 pH pH 0.300 0.102 EC EC -0.328 -0.001 OM OM -0.325 0.211 TN TN -0.325 0.199 AvP AvP -0.309 0.439 ExK ExK -0.319 0.321 Fe Fe -0.293 -0.586 Mn Mn -0.303 -0.509 Zn Zn -0.328 -0.080 Cu Cu -0.329 0.036
The loading table reveals a clear story. PC1 — the waterlogging / redox axis: almost all variables load highly and negatively on PC1, except pH which loads positively (+0.300). This reflects the seasonal waterlogging effect: under post-monsoon flooding, Fe, Mn, EC, OM, TN, Av-P, and Ex-K all increase while pH drops — so post-monsoon samples score negatively on PC1, and pre-monsoon samples score positively. The sign of a PC is arbitrary in PCA; what matters is that PC1 cleanly separates the two seasons along a single redox-driven gradient.
PC2 — the site fertility axis: Av-P (+0.439), Ex-K (+0.321), OM (+0.211), and TN (+0.199) load positively on PC2, while Fe (−0.586) and Mn (−0.509) load strongly negatively. This separates high-nutrient sites (S4 pre-monsoon: high organic matter accumulation, elevated P and K) from sites where Fe and Mn dominate the chemistry (lower-nutrient, higher-reduction sites). PC2 is a site-level fertility signal independent of season.
| Variable | PC1 Loading | PC2 Loading | Interpretation |
|---|---|---|---|
| pH | +0.300 | +0.10 | Only variable to load positively on PC1; decreases under waterlogging |
| EC | −0.328 | −0.00 | Ions released under reduction; strongly seasonal, minimal PC2 signal |
| OM | −0.325 | +0.21 | Both seasonal and site fertility drivers |
| TN | −0.325 | +0.20 | Tightly coupled to OM; site + season signal |
| Av-P | −0.309 | +0.44 | Strongest positive PC2 loading — key site fertility indicator |
| Ex-K | −0.319 | +0.32 | Fertility axis; elevated in organic-rich sites |
| Fe | −0.293 | −0.586 | Strongest negative PC2 loading — pure redox indicator |
| Mn | −0.303 | −0.509 | Co-varies with Fe; second strongest PC2 loading |
| Zn | −0.328 | −0.08 | Mainly a seasonal signal; minimal PC2 contribution |
| Cu | −0.329 | +0.04 | Strong PC1 loading; essentially no PC2 signal |
05 Biplot
A biplot overlays the sample scores and the variable loadings on the same plot. Samples that are close together are similar in overall soil chemistry. Variables with arrows pointing in the same direction are positively correlated. Variables with arrows in opposite directions are negatively correlated. The length of an arrow indicates how well that variable is represented by the two displayed components.
# Base R biplot (quick check) biplot(pca, scale = 0, xlab = paste0("PC1 (", round(var_exp[1], 1), "% variance)"), ylab = paste0("PC2 (", round(var_exp[2], 1), "% variance)"), main = "Haor Basin Paddy Soil PCA Biplot") # Publication-ready biplot using ggplot2 only (no extra packages needed) scores_plot <- as.data.frame(pca$x[, 1:2]) scores_plot$site <- soil$site scores_plot$season <- soil$season # Scale loadings to overlay on score plot scale_factor <- 3 loadings_plot <- as.data.frame(pca$rotation[, 1:2] * scale_factor) loadings_plot$var <- rownames(loadings_plot) ggplot() + # Sample scores geom_point(data = scores_plot, aes(x = PC1, y = PC2, colour = season, shape = site), size = 3.5) + geom_text(data = scores_plot, aes(x = PC1, y = PC2, label = paste0(site, "-", substr(season, 1, 2))), nudge_y = 0.22, size = 3) + # Loading arrows geom_segment(data = loadings_plot, aes(x = 0, y = 0, xend = PC1, yend = PC2), arrow = arrow(length = unit(0.2, "cm")), colour = "#8b7e5a", linewidth = 0.7) + geom_text(data = loadings_plot, aes(x = PC1 * 1.12, y = PC2 * 1.12, label = var), colour = "#8b7e5a", size = 3.2, fontface = "bold") + scale_colour_manual(values = c("Pre" = "#1e1c18", "Post" = "#b44028")) + labs(x = paste0("PC1 (", round(var_exp[1], 1), "% variance)"), y = paste0("PC2 (", round(var_exp[2], 1), "% variance)"), colour = "Season", title = "PCA Biplot — Haor Basin Paddy Soils") + theme_classic(base_size = 13) + theme(legend.position = "bottom") ggsave("haor_pca_biplot.png", dpi = 300, width = 7, height = 6)
The biplot from this dataset shows a clear separation between pre-monsoon samples (positive PC1, right side) and post-monsoon samples (negative PC1, left side) along the horizontal axis. Because pH is the only variable with a positive loading on PC1, its arrow points right (pre-monsoon = higher pH), while Fe, Mn, EC, and all nutrient arrows point left (post-monsoon = more waterlogging, higher Fe/Mn/EC). Along the vertical axis (PC2), S4 pre-monsoon sits at the top (high Av-P, Ex-K, OM relative to Fe/Mn) — the fertility signal. S1 post-monsoon sits at the bottom of PC2 (Fe and Mn elevated relative to nutrients). This visual structure confirms what the loadings table told us arithmetically.
06 Extract Scores for Downstream Analysis
PC scores are the coordinates of each observation in the PC space. You can extract them and use them as new variables — for example, as inputs to cluster analysis, as predictors in a regression, or simply to label which samples are most extreme on each axis.
# Extract scores scores <- as.data.frame(pca$x) scores$site <- soil$site scores$season <- soil$season # View PC1 and PC2 scores with site/season labels print(scores[, c("site","season","PC1","PC2")]) # Identify the most extreme samples on each axis cat("Highest PC1 (most 'post-monsoon' character):\n") scores[which.max(scores$PC1), ] cat("Highest PC2 (highest fertility):\n") scores[which.max(scores$PC2), ] # Use PC scores as input to k-means clustering set.seed(42) km <- kmeans(scores[, c("PC1","PC2")], centers = 2) scores$cluster <- as.factor(km$cluster) table(scores$season, scores$cluster)
site season PC1 PC2 1 S1 Pre 2.186 0.240 2 S1 Post -0.872 -1.055 ← lowest PC2 3 S2 Pre 1.262 0.700 4 S2 Post -2.526 -0.522 5 S3 Pre 3.348 0.267 6 S3 Post 0.408 -0.771 7 S4 Pre -2.315 1.568 ← highest PC2 8 S4 Post -6.936 0.405 ← lowest PC1 (most post-monsoon) 9 S5 Pre 2.046 0.349 10 S5 Post -1.521 -0.752 11 S6 Pre 3.907 0.271 ← highest PC1 (most pre-monsoon) 12 S6 Post 1.013 -0.699
S4 post-monsoon (row 8) has the most negative PC1 score (−6.94), making it the most waterlogging-affected sample in the dataset. S6 pre-monsoon has the highest PC1 score (+3.91), representing the most aerobic, pre-flood state. On PC2, S4 pre-monsoon scores highest (+1.57) — this site has high OM and nutrients accumulated from organic matter inputs, but measured before the onset of waterlogging so Fe and Mn are still low. S1 post-monsoon scores most negatively on PC2 (−1.06), reflecting Fe and Mn elevated relative to its nutrient status.
The k-means clustering on PC scores mostly separates seasons (5 Pre in one cluster, 4 Post in another), though S4 Pre clusters with the post-monsoon group because its negative PC1 score reflects its characteristically higher OM and nutrients even before flooding. This imperfect separation confirms that site-level differences (PC2) add meaningful variation beyond the dominant seasonal signal.
Writing Up PCA Results
PCA results in a manuscript typically go in the Results section, with a brief description of the method in Materials and Methods. Here is a template based on the Haor basin dataset:
PCA of ten soil chemical variables across twelve Haor basin paddy soil samples (6 sites × 2 seasons) reduced data dimensionality to two principal components (PCs) that together explained 97.83% of total variance (PC1: 91.95%; PC2: 5.88%). The scree plot indicated a sharp break after PC1, and the first component substantially exceeded the Kaiser eigenvalue criterion (>1.0) while PC2 was retained based on the scree plot elbow.
PC1 (91.95% variance) represented a seasonal waterlogging–redox gradient, characterised by strong negative loadings for EC (−0.328), Zn (−0.328), Cu (−0.329), OM (−0.325), TN (−0.325), and positive loading for pH (+0.300). Post-monsoon samples consistently scored negatively on PC1, reflecting reduction-driven release of Fe²⁺ and Mn²⁺ and associated pH decline, while pre-monsoon samples scored positively. PC2 (5.88% variance) represented a site fertility gradient, with the highest positive loadings for Av-P (+0.439), Ex-K (+0.321), and OM (+0.211), and the strongest negative loadings for Fe (−0.586) and Mn (−0.509). Site S4 pre-monsoon scored highest on PC2, reflecting its history of organic matter inputs and elevated plant-available nutrients independent of seasonal flooding.
Mistakes to Avoid
Running PCA without scaling when variables have different units (Fe in mg/kg vs pH in log units) produces a PCA driven entirely by whichever variable has the largest numerical values. Always set scale. = TRUE in prcomp() unless you have a specific reason to use the covariance matrix and all variables are in the same units.
Keeping only PC1 when PC2 explains meaningful variance (say, >10%) discards important structure. Keeping six components when only two exceed eigenvalue 1.0 produces uninterpretable results. Use the scree plot and the Kaiser criterion together — and always check that the cumulative variance of retained components is at least 60–70%.
The terminology in PCA gets confused with factor analysis in some textbooks. In prcomp(): $rotation contains the loadings (contribution of each variable to each PC); $x contains the scores (position of each observation on each PC). If you mix these up, your interpretation will be inverted.
A biplot without a variance explanation table is incomplete. Always report the proportion of variance explained by each retained component, and the cumulative proportion. Readers need to know whether your PC1 explains 86% or 32% of total variance — the interpretation is very different.
PCA tells you about structure, not about significance. If you want to test whether two groups (pre-monsoon vs post-monsoon) differ significantly in their multivariate soil chemistry, use MANOVA or PERMANOVA on the PC scores — not PCA alone. PCA is an exploratory first step; the test comes after.
PCA is one of the most useful tools in environmental data analysis once you understand what it is actually doing. If you have a multi-variable environmental or agricultural dataset and want PCA run properly — with interpretation, a publication-ready biplot, and a written results section — the data analysis service covers exactly this.
Sajjadur Rahman
MSc Researcher · Data Analyst · University of DhakaNST Fellow and active researcher using R, SPSS, and Python across soil science and environmental experiments. Developer of SPADE. Experienced with PCA, cluster analysis, ANOVA, and multivariate ordination on environmental datasets from Bangladesh. Available for data analysis, biplot interpretation, and manuscript writing.