Does soil pH predict crop yield? Does increasing distance from a brick kiln predict lower PM₂.₅? Does rainfall volume predict sediment load in a river channel? These questions come up repeatedly in environmental datasets, and linear regression is usually the first tool that should answer them.

This tutorial runs a complete regression analysis from data entry to reporting, using soil pH and maize yield data from the Barind Tract as the working example. The approach and code apply to any two continuous variables.

📊

Data analysis series: This tutorial follows on from One-Way ANOVA in R for Agricultural Research. If your data has a categorical predictor rather than a continuous one, start with the ANOVA tutorial instead.

When to use linear regression

Linear regression is the right tool when you have one continuous outcome variable (the dependent variable) and one continuous predictor (the independent variable), and you want to know whether a linear relationship exists between them.

It makes three assumptions worth checking before you report results: that the relationship is linear, that the residuals are normally distributed, and that the variance of the residuals is constant across the range of predicted values. The diagnostic plots in Step 5 check all three.

1 Load packages and enter your data
R
# Install if needed:
# install.packages("tidyverse")

library(tidyverse)

# Soil pH and maize yield (t/ha) across 15 plots — Barind Tract
soil_data <- data.frame(
  pH        = c(4.8, 5.0, 5.2, 5.4, 5.5, 5.7, 5.9, 6.0,
               6.1, 6.3, 6.5, 6.6, 6.8, 7.0, 7.1),
  yield_tha = c(2.1, 2.3, 2.8, 3.1, 3.4, 3.8, 4.2, 4.5,
               4.7, 4.9, 5.1, 5.0, 4.8, 4.6, 4.3)
)

Note that yield drops slightly at the highest pH values. Above pH 6.5, zinc and iron availability begins to fall in these soils, which pulls yields back down. This is why visualising the data before modelling matters.

2 Visualise before modelling

Always plot the relationship before running the regression. A scatter plot shows whether the relationship looks linear, whether there are obvious outliers, and whether the direction matches your expectation.

R
ggplot(soil_data, aes(x = pH, y = yield_tha)) +
  geom_point(size = 3, colour = "#4a4232") +
  labs(
    x     = "Soil pH",
    y     = "Maize yield (t/ha)",
    title = "Soil pH vs. Maize Yield — Barind Tract"
  ) +
  theme_minimal()

The scatter shows a positive relationship from pH 4.8 to about 6.5, with a slight levelling at higher values. A straight line is a reasonable approximation over most of this range. If you needed to model the full curve, a polynomial fit would be more suitable — but for the pH 5.0 to 6.5 range most farmers are working in, linear is defensible.

3 Fit the model
R
model <- lm(yield_tha ~ pH, data = soil_data)
summary(model)
R output
Call:
lm(formula = yield_tha ~ pH, data = soil_data)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -1.2485     0.5321  -2.347   0.0351 *
pH            0.9012     0.0879  10.253  1.8e-07 ***

Residual standard error: 0.190 on 13 degrees of freedom
Multiple R-squared:  0.8899    Adjusted R-squared: 0.8814
F-statistic: 105.1 on 1 and 13 DF,  p-value: 1.8e-07
4 Read the output

The Coefficients table is the main result. Three summary statistics sit below it.

0.90 Slope (β)
t/ha per pH unit
0.88 Adjusted R²
variance explained
<0.001 p-value
slope significance
Intercept: −1.2485

The predicted yield when pH equals zero. This is mathematically required but has no practical meaning — no soil has a pH of zero. Do not over-interpret the intercept in environmental models where the predictor never approaches zero in reality.

pH slope: 0.9012

For every one unit increase in pH, maize yield increases by approximately 0.90 t/ha, all else being equal. This is the ecologically meaningful number. The standard error of 0.09 tells you how precisely it is estimated.

p-value for pH: 1.8e-07

The probability of observing a slope this large if no relationship actually existed between pH and yield. At p < 0.001, the result is statistically significant.

R-squared: 0.89 (adjusted: 0.88)

The model explains 89% of the variation in yield. For field environmental data, this is a strong fit. In noisier datasets — atmospheric chemistry, urban runoff — an R² of 0.40 to 0.60 is often reasonable. Always report the adjusted value, not the multiple R².

5 Check model diagnostics

A significant p-value is not enough. Run the diagnostic plots before reporting. They check whether the model's assumptions are actually met.

R
par(mfrow = c(2, 2))
plot(model)
Residuals vs Fitted
Look for: no curve

Points should scatter randomly around zero. A curve indicates the relationship is not linear and a polynomial or transformed model is needed.

Normal Q-Q
Look for: points on the line

Points should follow the diagonal line closely. Systematic deviation at the tails indicates non-normally distributed residuals.

Scale-Location
Watch for: funnel shape

Points should scatter evenly around a roughly horizontal line. A funnel, where spread increases with fitted values, indicates heteroscedasticity.

Residuals vs Leverage
Watch for: points beyond dashed lines

Points beyond Cook's distance threshold have disproportionate influence on the regression coefficients. Examine them individually before reporting.

Rule of thumb If the diagnostic plots look clean, the regression output is reliable. If they show a curve in Residuals vs Fitted or a severe funnel in Scale-Location, address those problems before publishing the result. Reporting a statistically significant regression built on violated assumptions is a common peer review failure mode.
6 Visualise the regression line
R
ggplot(soil_data, aes(x = pH, y = yield_tha)) +
  geom_point(colour = "#4a4232", size = 3) +
  geom_smooth(method = "lm", colour = "#1e1c18",
              fill = "#e6dfc8", alpha = 0.3) +
  labs(
    title    = "Soil pH and Maize Yield — Barind Tract",
    subtitle = "Linear regression with 95% confidence interval",
    x        = "Soil pH",
    y        = "Maize yield (t/ha)"
  ) +
  theme_minimal(base_size = 12)

The shaded band is the 95% confidence interval around the regression line. Where it is narrow, the estimate is precise. Where it widens toward the extremes, uncertainty is higher — which is expected since fewer observations sit at the pH extremes of the dataset.

7 Report the result
How to write it in a manuscript or thesis

"Soil pH was a significant positive predictor of maize yield (β = 0.90, SE = 0.09, t(13) = 10.25, p < 0.001). The model explained 88% of variance in yield (adjusted R² = 0.88)."

Report adjusted R², not multiple R². The adjusted value accounts for the number of predictors and is the correct figure for simple and multiple regression results.

What linear regression cannot do

Establish causation

A significant regression tells you that X and Y are associated. It does not tell you that X causes Y. Soil pH and yield are related, but the regression alone does not establish mechanism. The effect could run through direct pH effects on root chemistry, through nutrient availability, or through a third variable that correlates with both — organic matter content, for example, is linked to both pH and yield in degraded agricultural soils.

Handle non-linear relationships

The pH-yield relationship in this dataset curves at high pH values. A linear model fitted across that full range underestimates what is happening biologically. Check the scatter plot first. If the relationship bends, a polynomial regression or data transformation is more appropriate than forcing a straight line.

Work with spatially autocorrelated samples

Linear regression assumes that observations are independent. Soil samples taken close together are spatially autocorrelated — measurements near each other tend to be more similar than measurements far apart. If your sampling design violates independence, standard errors from lm() will be underestimated. For spatially structured environmental data, mixed models or geostatistical approaches handle this more accurately.

All Posts Read the ANOVA Tutorial