Crop Yield Prediction with an LSTM Autoencoder Architecture on Historical Climate and Crop Management Data

Crop Yield Prediction with an LSTM Autoencoder Architecture on Historical Climate and Crop Management Data

The work described in this document aims to build a Crop Yield Prediction model which takes into account weather information. In particular, the proposed model takes as input the cultivation features of the current year (e.g., the sum cultivated area, the irrigation percentage and class, the cultivated species, etc.) and the sequence of crop features, yield and weather of some previous year, for predicting the crop yield of the current year.

The importance of the model lies in the capability of measuring the effects of climate changes, encoded as weather information along a time series of years, on the production of specific cultivated species (durum wheat, common wheat, soy, hybrid corn, common wine grape, quality wine grape).

Combined with a suitable mathematical model for the climate change, capable of predicting future weather events, this model can effectively predict the impact of such changes on the production in the future.

The work was carried out in collaboration with the University of Padua and the University of Tuscia.

Dataset

The input dataset contains yield data of agricultural factories, aggregated by agricultural regions. Each region includes a varying number of factories that frequently enter and exit the monitoring process over time, which required in-depth investigation and several filtering and preprocessing steps to ensure data consistency and reliability.

Starting from the raw data, time series were built for each region–crop species combination, integrating historical yields, agronomic characteristics (such as cultivated areas and irrigation indicators), and monthly weather variables over 12 months for each year.

These enriched time series were then used to define the training and test sets on which the model was trained and validated by analysing the regression residuals.

Methodology

For modelling timeseries and features, an LSTM-based model has been designed.

The following figure shows the architecture of the proposed model:

The model takes the following parameters as inputs:

C_seq: the sequences of previous years, containing all variables about weather (12 months per timestep), yield and agricultural features, such as the cultivated area and irrigation percent mean.
C_quant: all numerical features about current year to predict the yield of. These features includes the 12 monthly weather variables, the cultivated area sum and irrigation percentage mean;
C_cat: all categorical features, i.e. irrigation class and the cultivated species.

All numerical features, i.e. C_seq and C_cat, are passed to Normalization layers adapted on the training set.

For modelling timeseries, an LSTM autoencoder is applied, and a stack of fully-connected layers with ReLU function and dropout, is used for regressing the yield.

In order to regress the yield the model concatenates the categorical features and the normalized numerical ones with the latent representation (L) of timeseries, learned by the LSTM autoencoder.

The outputs of the model are:

C_seq’: the output of the LSTM autoencoder, i.e. the reconstructed C_seq input;

Y: the regressed yield value.

For both outputs, Mean Squared Error (MSE) is used as loss function.

Experiments

The experiments were designed starting from the construction of historical time series and the definition of training and test sets, explicitly accounting for the variability in the number of farms per region and their dynamic entry and exit from the monitoring process. After the preprocessing and aggregation steps, the LSTM‑based model was trained on a dedicated training set and evaluated on a separate validation/test set, tracking the behaviour of Mean Squared Error (MSE) and Mean Absolute Error (MAE) for both the autoencoder and the yield regressor over time.

The training curves reported show the evolution of MSE and MAE during training, confirming that the model does not suffer from overfitting and maintains stable performance across species.

Particular attention was devoted to the analysis of regression residuals by crop species, in order to detect potential systematic biases and performance differences among crops.

Results

The trained model does not show signs of overfitting, as confirmed by the training histories, where MSE and MAE evolve consistently on both the training and validation sets for the autoencoder and the yield regressor.

The table reporting the mean and variance of prediction residuals by crop species shows very low values, indicating a good predictive capability for common wheat, durum wheat, soy, hybrid corn, common wine grape and quality wine grape.

The “Actual vs Predicted” plots highlight a strong alignment between observed and estimated yields and suggest that the distribution of points is influenced by the cultivated species, confirming that the model captures the specific production patterns of each crop.