Does soil pH predict crop yield? Does increasing distance from a brick kiln predict lower PM₂.₅? Does rainfall volume predict sediment load in a river channel? These questions come up repeatedly in environmental datasets, and linear regression is usually the first tool that should answer them.
This tutorial runs a complete regression analysis from data entry to reporting, using soil pH and maize yield data from the Barind Tract as the working example. The approach and code apply to any two continuous variables.
Data analysis series: This tutorial follows on from One-Way ANOVA in R for Agricultural Research. If your data has a categorical predictor rather than a continuous one, start with the ANOVA tutorial instead.
When to use linear regression
Linear regression is the right tool when you have one continuous outcome variable (the dependent variable) and one continuous predictor (the independent variable), and you want to know whether a linear relationship exists between them.
It makes three assumptions worth checking before you report results: that the relationship is linear, that the residuals are normally distributed, and that the variance of the residuals is constant across the range of predicted values. The diagnostic plots in Step 5 check all three.
# Install if needed:
# install.packages("tidyverse")
library(tidyverse)
# Soil pH and maize yield (t/ha) across 15 plots — Barind Tract
soil_data <- data.frame(
pH = c(4.8, 5.0, 5.2, 5.4, 5.5, 5.7, 5.9, 6.0,
6.1, 6.3, 6.5, 6.6, 6.8, 7.0, 7.1),
yield_tha = c(2.1, 2.3, 2.8, 3.1, 3.4, 3.8, 4.2, 4.5,
4.7, 4.9, 5.1, 5.0, 4.8, 4.6, 4.3)
)
Note that yield drops slightly at the highest pH values. Above pH 6.5, zinc and iron availability begins to fall in these soils, which pulls yields back down. This is why visualising the data before modelling matters.
Always plot the relationship before running the regression. A scatter plot shows whether the relationship looks linear, whether there are obvious outliers, and whether the direction matches your expectation.
ggplot(soil_data, aes(x = pH, y = yield_tha)) +
geom_point(size = 3, colour = "#4a4232") +
labs(
x = "Soil pH",
y = "Maize yield (t/ha)",
title = "Soil pH vs. Maize Yield — Barind Tract"
) +
theme_minimal()
The scatter shows a positive relationship from pH 4.8 to about 6.5, with a slight levelling at higher values. A straight line is a reasonable approximation over most of this range. If you needed to model the full curve, a polynomial fit would be more suitable — but for the pH 5.0 to 6.5 range most farmers are working in, linear is defensible.
model <- lm(yield_tha ~ pH, data = soil_data)
summary(model)
Call:
lm(formula = yield_tha ~ pH, data = soil_data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.2485 0.5321 -2.347 0.0351 *
pH 0.9012 0.0879 10.253 1.8e-07 ***
Residual standard error: 0.190 on 13 degrees of freedom
Multiple R-squared: 0.8899 Adjusted R-squared: 0.8814
F-statistic: 105.1 on 1 and 13 DF, p-value: 1.8e-07
The Coefficients table is the main result. Three summary statistics sit below it.
t/ha per pH unit
variance explained
slope significance
The predicted yield when pH equals zero. This is mathematically required but has no practical meaning — no soil has a pH of zero. Do not over-interpret the intercept in environmental models where the predictor never approaches zero in reality.
For every one unit increase in pH, maize yield increases by approximately 0.90 t/ha, all else being equal. This is the ecologically meaningful number. The standard error of 0.09 tells you how precisely it is estimated.
The probability of observing a slope this large if no relationship actually existed between pH and yield. At p < 0.001, the result is statistically significant.
The model explains 89% of the variation in yield. For field environmental data, this is a strong fit. In noisier datasets — atmospheric chemistry, urban runoff — an R² of 0.40 to 0.60 is often reasonable. Always report the adjusted value, not the multiple R².
A significant p-value is not enough. Run the diagnostic plots before reporting. They check whether the model's assumptions are actually met.
par(mfrow = c(2, 2))
plot(model)
Points should scatter randomly around zero. A curve indicates the relationship is not linear and a polynomial or transformed model is needed.
Points should follow the diagonal line closely. Systematic deviation at the tails indicates non-normally distributed residuals.
Points should scatter evenly around a roughly horizontal line. A funnel, where spread increases with fitted values, indicates heteroscedasticity.
Points beyond Cook's distance threshold have disproportionate influence on the regression coefficients. Examine them individually before reporting.
ggplot(soil_data, aes(x = pH, y = yield_tha)) +
geom_point(colour = "#4a4232", size = 3) +
geom_smooth(method = "lm", colour = "#1e1c18",
fill = "#e6dfc8", alpha = 0.3) +
labs(
title = "Soil pH and Maize Yield — Barind Tract",
subtitle = "Linear regression with 95% confidence interval",
x = "Soil pH",
y = "Maize yield (t/ha)"
) +
theme_minimal(base_size = 12)
The shaded band is the 95% confidence interval around the regression line. Where it is narrow, the estimate is precise. Where it widens toward the extremes, uncertainty is higher — which is expected since fewer observations sit at the pH extremes of the dataset.
"Soil pH was a significant positive predictor of maize yield (β = 0.90, SE = 0.09, t(13) = 10.25, p < 0.001). The model explained 88% of variance in yield (adjusted R² = 0.88)."
What linear regression cannot do
A significant regression tells you that X and Y are associated. It does not tell you that X causes Y. Soil pH and yield are related, but the regression alone does not establish mechanism. The effect could run through direct pH effects on root chemistry, through nutrient availability, or through a third variable that correlates with both — organic matter content, for example, is linked to both pH and yield in degraded agricultural soils.
The pH-yield relationship in this dataset curves at high pH values. A linear model fitted across that full range underestimates what is happening biologically. Check the scatter plot first. If the relationship bends, a polynomial regression or data transformation is more appropriate than forcing a straight line.
Linear regression assumes that observations are independent. Soil samples taken close together are spatially autocorrelated — measurements near each other tend to be more similar than measurements far apart. If your sampling design violates independence, standard errors from lm() will be underestimated. For spatially structured environmental data, mixed models or geostatistical approaches handle this more accurately.