Researcher · Writer · Consultant · Dhaka, Bangladesh গবেষক · লেখক · পরামর্শদাতা · ঢাকা, বাংলাদেশ
Home About Services Research Blog Contact
Blog Data Analysis

Principal Component Analysis
in R for Environmental Data

PCA in six steps — from raw soil chemistry data to interpreted components and a publication-ready biplot. Working example: paddy soil samples from the Haor basin, Sylhet, with ten variables measured across two seasons.

The ANOVA tutorial covered one dependent variable against one categorical factor. The regression tutorial covered one outcome against one or more continuous predictors. Those tools handle most thesis analyses — but once you have ten soil parameters measured across twenty sites and two seasons, you need something different. You do not want ten separate ANOVAs. You want to know which variables move together, what patterns they form, and which sites or seasons drive those patterns. That is what PCA does.

PCA is the most commonly used multivariate method in soil and environmental science, and one of the most commonly misapplied. This tutorial walks through the complete procedure in R using a realistic paddy soil dataset from the Haor basin in Sylhet — ten chemical variables, twelve observations, two sampling seasons. By the end you will understand what each output means, how to interpret loadings and scores, and how to write the results for a manuscript.

When PCA Is the Right Tool

Use PCA when you have multiple correlated continuous variables and you want to reduce them to a smaller number of uncorrelated dimensions that capture most of the variance in the data. The key word is correlated. If your variables are completely independent of each other, PCA produces nothing useful. If they share underlying structure — which soil chemistry variables almost always do — PCA reveals it.

Common applications in environmental and agricultural research include: identifying which soil properties co-vary across sites (usually driven by parent material, drainage, or management); separating seasonal or land-use effects from site-level variation; generating input variables for cluster analysis or classification; and reducing multicollinearity before regression.

PCA is exploratory. It does not test a hypothesis. It does not tell you whether two groups differ significantly — that is what ANOVA does. It tells you about the structure of your data: which variables cluster together, and which observations are similar or different. The statistical test guide in the previous post covers when to use PCA versus when to use a test.

PCA requires continuous variables

PCA works on a correlation or covariance matrix and requires that all variables be continuous and measured on a meaningful numerical scale. Categorical variables (soil texture class, land use type, treatment group) cannot go into a PCA directly. Convert them to binary dummy variables only if you have a specific reason to do so — and be aware that the interpretation becomes more complex.

What PCA Actually Does

PCA rotates your original variable axes to find the directions of maximum variance in the data. The first principal component (PC1) is the axis that explains the most variance. PC2 is perpendicular to PC1 and explains the next most variance. PC3 is perpendicular to both, and so on. Each PC is a linear combination of the original variables.

The mathematical machinery is eigendecomposition of the correlation matrix (or covariance matrix, but correlation is almost always better for environmental data with mixed units and scales). Each PC corresponds to an eigenvector, and the eigenvalue tells you how much variance that component explains. A component with an eigenvalue above 1.0 explains more variance than any single original variable — this is the Kaiser criterion for deciding how many components to retain.

The loadings are the correlations between each original variable and each PC. A high positive loading means the variable contributes strongly and positively to that component. A high negative loading means strong negative contribution. Loadings close to zero mean the variable is essentially unrelated to that component.

The scores are the coordinates of each observation (each soil sample, in our case) in the new PC coordinate system. They tell you where each sample sits relative to the patterns the PCA identified.

The Dataset — Haor Basin Paddy Soils

The Haor basin in greater Sylhet is one of Bangladesh's largest freshwater wetland systems. Paddy cultivation in the basin alternates between an aman rice crop during the monsoon season (waterlogged conditions, June–November) and a dry period (December–May) when fields are partially drained. This seasonal alternation between flooded and aerobic conditions drives dramatic changes in soil redox chemistry — particularly iron (Fe) and manganese (Mn), which become highly soluble under waterlogging.

The dataset below contains twelve soil samples from six sites, sampled once in the pre-monsoon period (January) and once post-monsoon (October). Ten variables were measured: pH, electrical conductivity (EC in dS/m), organic matter (OM in %), total nitrogen (TN in %), available phosphorus (Av-P in mg/kg), exchangeable potassium (Ex-K in cmol/kg), and four micronutrients — iron (Fe), manganese (Mn), zinc (Zn), and copper (Cu), all in mg/kg.

Site Season pH EC OM% TN% Av-P Ex-K Fe Mn Zn Cu
S1 Pre 6.20.242.1 0.098.40.18 19518.21.22.1
S1 Post 5.80.312.6 0.1110.20.22 32834.61.82.8
S2 Pre 6.50.282.4 0.1011.80.21 21020.11.42.3
S2 Post 6.00.383.0 0.1314.60.28 36238.42.03.1
S3 Pre 6.80.221.9 0.087.20.16 18215.81.11.9
S3 Post 6.30.292.3 0.109.80.20 29830.21.62.6
S4 Pre 5.90.353.2 0.1418.40.32 24826.41.93.2
S4 Post 5.50.484.0 0.1822.60.40 41846.82.84.2
S5 Pre 6.60.262.2 0.099.20.19 20419.41.32.2
S5 Post 6.10.342.8 0.1212.40.25 34436.81.92.9
S6 Pre 7.00.201.8 0.086.80.14 17614.61.01.8
S6 Post 6.50.282.2 0.109.40.18 28628.41.52.4

S4 is the high-organic-matter site (rice straw incorporated for several seasons, lower pH from organic acid accumulation). S6 is sandier with lower nutrient status and higher pH. The post-monsoon samples consistently show lower pH, higher EC, higher Fe and Mn, and higher OM and nutrient values — the signature of waterlogging-driven reduction and organic matter decomposition.

01 Prepare and Scale the Data

Enter the data into R and check it before running PCA. Two things matter here: making sure there are no missing values, and deciding whether to scale.

Scaling (standardising to mean = 0, variance = 1) is almost always necessary in environmental science because variables are on different scales and units. Without scaling, Fe (in mg/kg, values 176–418) would dominate the PCA simply because it has the largest numerical values — not because it explains the most meaningful variance. With scaling, each variable contributes based on its relative variability.

R
# Enter the dataset
soil <- data.frame(
  site    = c("S1","S1","S2","S2","S3","S3","S4","S4","S5","S5","S6","S6"),
  season  = c("Pre","Post","Pre","Post","Pre","Post","Pre","Post","Pre","Post","Pre","Post"),
  pH      = c(6.2, 5.8, 6.5, 6.0, 6.8, 6.3, 5.9, 5.5, 6.6, 6.1, 7.0, 6.5),
  EC      = c(0.24,0.31,0.28,0.38,0.22,0.29,0.35,0.48,0.26,0.34,0.20,0.28),
  OM      = c(2.1, 2.6, 2.4, 3.0, 1.9, 2.3, 3.2, 4.0, 2.2, 2.8, 1.8, 2.2),
  TN      = c(0.09,0.11,0.10,0.13,0.08,0.10,0.14,0.18,0.09,0.12,0.08,0.10),
  AvP     = c(8.4, 10.2,11.8,14.6, 7.2, 9.8,18.4,22.6, 9.2,12.4, 6.8, 9.4),
  ExK     = c(0.18,0.22,0.21,0.28,0.16,0.20,0.32,0.40,0.19,0.25,0.14,0.18),
  Fe      = c(195, 328, 210, 362, 182, 298, 248, 418, 204, 344, 176, 286),
  Mn      = c(18.2,34.6,20.1,38.4,15.8,30.2,26.4,46.8,19.4,36.8,14.6,28.4),
  Zn      = c(1.2, 1.8, 1.4, 2.0, 1.1, 1.6, 1.9, 2.8, 1.3, 1.9, 1.0, 1.5),
  Cu      = c(2.1, 2.8, 2.3, 3.1, 1.9, 2.6, 3.2, 4.2, 2.2, 2.9, 1.8, 2.4)
)

# Check for missing values
sum(is.na(soil))  # should return 0

# Select only the numeric variables for PCA
soil_num <- soil[, c("pH","EC","OM","TN","AvP","ExK","Fe","Mn","Zn","Cu")]

# Check correlations — high correlations mean PCA will find structure
round(cor(soil_num), 2)

02 Run PCA with prcomp()

R has two PCA functions: prcomp() and princomp(). Use prcomp(). It uses singular value decomposition (SVD), which is numerically more stable than the eigenvalue decomposition that princomp() uses. Set scale. = TRUE to scale variables automatically — this is equivalent to running PCA on the correlation matrix.

R
# Run PCA — scale. = TRUE standardises variables
pca <- prcomp(soil_num, center = TRUE, scale. = TRUE)

# Summary: standard deviation, proportion of variance, cumulative proportion
summary(pca)
Output
Importance of components:
                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     3.0323 0.76705 0.43303 0.11347 0.08004 0.06760 0.05622
Proportion of Variance 0.9195 0.05884 0.01875 0.00129 0.00064 0.00046 0.00032
Cumulative Proportion  0.9195 0.97829 0.99704 0.99833 0.99897 0.99943 0.99974
                           PC8     PC9    PC10
Standard deviation     0.04512 0.01881 0.01281
Proportion of Variance 0.00020 0.00004 0.00002
Cumulative Proportion  0.99995 0.99998 1.00000

PC1 alone explains 91.95% of the total variance. PC2 explains another 5.88%. Together, two components capture 97.83% of the information in ten variables. This is an exceptionally clean PCA result — it tells you that the ten soil properties are essentially varying along one dominant axis, with a secondary axis capturing most of the remaining structure.

This makes biological sense. Waterlogging during the post-monsoon season drives Fe, Mn, EC upward and pH downward simultaneously — one process affecting multiple variables at once. That shared driver is what PC1 is capturing.

03 Scree Plot — How Many Components?

A scree plot graphs the eigenvalues (or variance proportions) for each component. You look for an "elbow" — the point where the curve flattens — and retain the components above it. The Kaiser criterion (eigenvalue > 1.0) is the most commonly used cutoff in environmental science.

R
# Scree plot — base R
plot(pca, type = "l", main = "Scree Plot — Haor Basin Soil PCA")
abline(h = 1, lty = 2, col = "red")  # Kaiser criterion line

# Or with ggplot2 for a publication-ready version
library(ggplot2)
var_exp <- pca$sdev^2 / sum(pca$sdev^2) * 100
scree_df <- data.frame(PC = paste0("PC", 1:length(var_exp)), Variance = var_exp)
scree_df$PC <- factor(scree_df$PC, levels = scree_df$PC)

ggplot(scree_df, aes(x = PC, y = Variance, group = 1)) +
  geom_col(fill = "#8b7e5a", alpha = 0.8, width = 0.6) +
  geom_line(colour = "#1e1c18", linewidth = 0.8) +
  geom_point(colour = "#1e1c18", size = 3) +
  geom_hline(yintercept = 10, linetype = "dashed", colour = "red") +
  labs(x = "Principal Component",
       y = "Variance Explained (%)",
       title = "Scree Plot — Haor Basin Paddy Soil PCA") +
  theme_classic(base_size = 13)

ggsave("haor_scree_plot.png", dpi = 300, width = 7, height = 4.5)

In this dataset, the scree plot shows a sharp drop after PC1 (which explains ~86%) and a much smaller contribution from PC2 (~16%). Components PC3 and beyond contribute less than 6% each and fall below the eigenvalue = 1.0 criterion. Retain PC1 and PC2 — they together explain over 97% of the total variance, which is exceptional for environmental data (60–80% cumulative across 2–3 PCs is more typical).

04 Loadings — What Does Each Component Measure?

Loadings are the most important part of interpreting a PCA. They tell you how strongly each original variable contributes to each PC, and in which direction. A loading > 0.6 in absolute value is generally considered high; > 0.4 is moderate.

R
# Loadings (rotation matrix)
round(pca$rotation[, 1:2], 3)

# As a data frame for easier reading
loadings_df <- data.frame(
  Variable = rownames(pca$rotation),
  PC1      = round(pca$rotation[, 1], 3),
  PC2      = round(pca$rotation[, 2], 3)
)
print(loadings_df)
Output — PC1 and PC2 Loadings
    Variable    PC1    PC2
pH        pH  0.300   0.102
EC        EC -0.328  -0.001
OM        OM -0.325   0.211
TN        TN -0.325   0.199
AvP      AvP -0.309   0.439
ExK      ExK -0.319   0.321
Fe        Fe -0.293  -0.586
Mn        Mn -0.303  -0.509
Zn        Zn -0.328  -0.080
Cu        Cu -0.329   0.036

The loading table reveals a clear story. PC1 — the waterlogging / redox axis: almost all variables load highly and negatively on PC1, except pH which loads positively (+0.300). This reflects the seasonal waterlogging effect: under post-monsoon flooding, Fe, Mn, EC, OM, TN, Av-P, and Ex-K all increase while pH drops — so post-monsoon samples score negatively on PC1, and pre-monsoon samples score positively. The sign of a PC is arbitrary in PCA; what matters is that PC1 cleanly separates the two seasons along a single redox-driven gradient.

PC2 — the site fertility axis: Av-P (+0.439), Ex-K (+0.321), OM (+0.211), and TN (+0.199) load positively on PC2, while Fe (−0.586) and Mn (−0.509) load strongly negatively. This separates high-nutrient sites (S4 pre-monsoon: high organic matter accumulation, elevated P and K) from sites where Fe and Mn dominate the chemistry (lower-nutrient, higher-reduction sites). PC2 is a site-level fertility signal independent of season.

Variable PC1 Loading PC2 Loading Interpretation
pH +0.300 +0.10 Only variable to load positively on PC1; decreases under waterlogging
EC −0.328 −0.00 Ions released under reduction; strongly seasonal, minimal PC2 signal
OM −0.325 +0.21 Both seasonal and site fertility drivers
TN −0.325 +0.20 Tightly coupled to OM; site + season signal
Av-P −0.309 +0.44 Strongest positive PC2 loading — key site fertility indicator
Ex-K −0.319 +0.32 Fertility axis; elevated in organic-rich sites
Fe −0.293 −0.586 Strongest negative PC2 loading — pure redox indicator
Mn −0.303 −0.509 Co-varies with Fe; second strongest PC2 loading
Zn −0.328 −0.08 Mainly a seasonal signal; minimal PC2 contribution
Cu −0.329 +0.04 Strong PC1 loading; essentially no PC2 signal

05 Biplot

A biplot overlays the sample scores and the variable loadings on the same plot. Samples that are close together are similar in overall soil chemistry. Variables with arrows pointing in the same direction are positively correlated. Variables with arrows in opposite directions are negatively correlated. The length of an arrow indicates how well that variable is represented by the two displayed components.

R
# Base R biplot (quick check)
biplot(pca, scale = 0,
       xlab = paste0("PC1 (", round(var_exp[1], 1), "% variance)"),
       ylab = paste0("PC2 (", round(var_exp[2], 1), "% variance)"),
       main = "Haor Basin Paddy Soil PCA Biplot")

# Publication-ready biplot using ggplot2 only (no extra packages needed)
scores_plot <- as.data.frame(pca$x[, 1:2])
scores_plot$site   <- soil$site
scores_plot$season <- soil$season

# Scale loadings to overlay on score plot
scale_factor <- 3
loadings_plot <- as.data.frame(pca$rotation[, 1:2] * scale_factor)
loadings_plot$var <- rownames(loadings_plot)

ggplot() +
  # Sample scores
  geom_point(data = scores_plot,
             aes(x = PC1, y = PC2, colour = season, shape = site), size = 3.5) +
  geom_text(data = scores_plot,
            aes(x = PC1, y = PC2,
                label = paste0(site, "-", substr(season, 1, 2))),
            nudge_y = 0.22, size = 3) +
  # Loading arrows
  geom_segment(data = loadings_plot,
               aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.2, "cm")),
               colour = "#8b7e5a", linewidth = 0.7) +
  geom_text(data = loadings_plot,
            aes(x = PC1 * 1.12, y = PC2 * 1.12, label = var),
            colour = "#8b7e5a", size = 3.2, fontface = "bold") +
  scale_colour_manual(values = c("Pre" = "#1e1c18", "Post" = "#b44028")) +
  labs(x = paste0("PC1 (", round(var_exp[1], 1), "% variance)"),
       y = paste0("PC2 (", round(var_exp[2], 1), "% variance)"),
       colour = "Season",
       title = "PCA Biplot — Haor Basin Paddy Soils") +
  theme_classic(base_size = 13) +
  theme(legend.position = "bottom")

ggsave("haor_pca_biplot.png", dpi = 300, width = 7, height = 6)

The biplot from this dataset shows a clear separation between pre-monsoon samples (positive PC1, right side) and post-monsoon samples (negative PC1, left side) along the horizontal axis. Because pH is the only variable with a positive loading on PC1, its arrow points right (pre-monsoon = higher pH), while Fe, Mn, EC, and all nutrient arrows point left (post-monsoon = more waterlogging, higher Fe/Mn/EC). Along the vertical axis (PC2), S4 pre-monsoon sits at the top (high Av-P, Ex-K, OM relative to Fe/Mn) — the fertility signal. S1 post-monsoon sits at the bottom of PC2 (Fe and Mn elevated relative to nutrients). This visual structure confirms what the loadings table told us arithmetically.

06 Extract Scores for Downstream Analysis

PC scores are the coordinates of each observation in the PC space. You can extract them and use them as new variables — for example, as inputs to cluster analysis, as predictors in a regression, or simply to label which samples are most extreme on each axis.

R
# Extract scores
scores <- as.data.frame(pca$x)
scores$site   <- soil$site
scores$season <- soil$season

# View PC1 and PC2 scores with site/season labels
print(scores[, c("site","season","PC1","PC2")])

# Identify the most extreme samples on each axis
cat("Highest PC1 (most 'post-monsoon' character):\n")
scores[which.max(scores$PC1), ]

cat("Highest PC2 (highest fertility):\n")
scores[which.max(scores$PC2), ]

# Use PC scores as input to k-means clustering
set.seed(42)
km <- kmeans(scores[, c("PC1","PC2")], centers = 2)
scores$cluster <- as.factor(km$cluster)
table(scores$season, scores$cluster)
Output — Scores (abbreviated)
   site season        PC1        PC2
1    S1    Pre   2.186   0.240
2    S1   Post  -0.872  -1.055  ← lowest PC2
3    S2    Pre   1.262   0.700
4    S2   Post  -2.526  -0.522
5    S3    Pre   3.348   0.267
6    S3   Post   0.408  -0.771
7    S4    Pre  -2.315   1.568  ← highest PC2
8    S4   Post  -6.936   0.405  ← lowest PC1 (most post-monsoon)
9    S5    Pre   2.046   0.349
10   S5   Post  -1.521  -0.752
11   S6    Pre   3.907   0.271  ← highest PC1 (most pre-monsoon)
12   S6   Post   1.013  -0.699

S4 post-monsoon (row 8) has the most negative PC1 score (−6.94), making it the most waterlogging-affected sample in the dataset. S6 pre-monsoon has the highest PC1 score (+3.91), representing the most aerobic, pre-flood state. On PC2, S4 pre-monsoon scores highest (+1.57) — this site has high OM and nutrients accumulated from organic matter inputs, but measured before the onset of waterlogging so Fe and Mn are still low. S1 post-monsoon scores most negatively on PC2 (−1.06), reflecting Fe and Mn elevated relative to its nutrient status.

The k-means clustering on PC scores mostly separates seasons (5 Pre in one cluster, 4 Post in another), though S4 Pre clusters with the post-monsoon group because its negative PC1 score reflects its characteristically higher OM and nutrients even before flooding. This imperfect separation confirms that site-level differences (PC2) add meaningful variation beyond the dominant seasonal signal.

Writing Up PCA Results

PCA results in a manuscript typically go in the Results section, with a brief description of the method in Materials and Methods. Here is a template based on the Haor basin dataset:

Results template

PCA of ten soil chemical variables across twelve Haor basin paddy soil samples (6 sites × 2 seasons) reduced data dimensionality to two principal components (PCs) that together explained 97.83% of total variance (PC1: 91.95%; PC2: 5.88%). The scree plot indicated a sharp break after PC1, and the first component substantially exceeded the Kaiser eigenvalue criterion (>1.0) while PC2 was retained based on the scree plot elbow.

PC1 (91.95% variance) represented a seasonal waterlogging–redox gradient, characterised by strong negative loadings for EC (−0.328), Zn (−0.328), Cu (−0.329), OM (−0.325), TN (−0.325), and positive loading for pH (+0.300). Post-monsoon samples consistently scored negatively on PC1, reflecting reduction-driven release of Fe²⁺ and Mn²⁺ and associated pH decline, while pre-monsoon samples scored positively. PC2 (5.88% variance) represented a site fertility gradient, with the highest positive loadings for Av-P (+0.439), Ex-K (+0.321), and OM (+0.211), and the strongest negative loadings for Fe (−0.586) and Mn (−0.509). Site S4 pre-monsoon scored highest on PC2, reflecting its history of organic matter inputs and elevated plant-available nutrients independent of seasonal flooding.

Mistakes to Avoid

Mistake 1 — Not scaling the data

Running PCA without scaling when variables have different units (Fe in mg/kg vs pH in log units) produces a PCA driven entirely by whichever variable has the largest numerical values. Always set scale. = TRUE in prcomp() unless you have a specific reason to use the covariance matrix and all variables are in the same units.

Mistake 2 — Retaining too many or too few components

Keeping only PC1 when PC2 explains meaningful variance (say, >10%) discards important structure. Keeping six components when only two exceed eigenvalue 1.0 produces uninterpretable results. Use the scree plot and the Kaiser criterion together — and always check that the cumulative variance of retained components is at least 60–70%.

Mistake 3 — Calling loadings "factors" and scores "loadings"

The terminology in PCA gets confused with factor analysis in some textbooks. In prcomp(): $rotation contains the loadings (contribution of each variable to each PC); $x contains the scores (position of each observation on each PC). If you mix these up, your interpretation will be inverted.

Mistake 4 — Reporting PCA without the variance table

A biplot without a variance explanation table is incomplete. Always report the proportion of variance explained by each retained component, and the cumulative proportion. Readers need to know whether your PC1 explains 86% or 32% of total variance — the interpretation is very different.

Mistake 5 — Treating PCA as a hypothesis test

PCA tells you about structure, not about significance. If you want to test whether two groups (pre-monsoon vs post-monsoon) differ significantly in their multivariate soil chemistry, use MANOVA or PERMANOVA on the PC scores — not PCA alone. PCA is an exploratory first step; the test comes after.

✦ ✦ ✦

PCA is one of the most useful tools in environmental data analysis once you understand what it is actually doing. If you have a multi-variable environmental or agricultural dataset and want PCA run properly — with interpretation, a publication-ready biplot, and a written results section — the data analysis service covers exactly this.

← Statistical Test Guide Multiple Regression →
SR

Sajjadur Rahman

MSc Researcher · Data Analyst · University of Dhaka

NST Fellow and active researcher using R, SPSS, and Python across soil science and environmental experiments. Developer of SPADE. Experienced with PCA, cluster analysis, ANOVA, and multivariate ordination on environmental datasets from Bangladesh. Available for data analysis, biplot interpretation, and manuscript writing.

💬