Logo Unione Europea
Logo Ministero
Logo Italia Domani
Logo Agritech

Predictive Tools for Crop Yields Using Econometric and Machine Learning Approaches

In this work, a machine learning–based methodology was developed to estimate expected crop yields and assess the economic impact of extreme weather events on farms.

The predictive models combine three main dimensions:

  • Structural and economic characteristics of the farm,
  • Planting and crop management practices,
  • Climatic conditions over the production cycle.

The work was carried out in collaboration with the University of Padua and the University of Tuscia.

 

Dataset

The dataset combines farm economic and structural information with climatic and meteorological indicators over the 2008–2022 period.

The main components are:

  • FADN data (2008–2022): describing farm productivity and characteristics (crop type, region/agricultural area, farm size).

  • Quantitative variables (C_quant): cultivated and irrigated area, labour hours for staff and machinery, water use per total surface and per hectare, and the average number of irrigation days.

  • Qualitative variables (C_cat): crop species, cultivation method, crop succession, and the presence of intercropping practices.

  • Climate data and extreme event indicators (C_seq): daily, monthly and yearly time series of evapotranspiration, precipitation, and maximum and minimum temperatures, later harmonised at daily resolution and aligned with individual crops.

The study focuses on six representative crops: durum wheat, soft wheat, hybrid corn, soybean, quality wine grape and common wine grape.

The target variable is the Yield Index (Y_class), defined as the ratio between harvested quantity and cultivated area, providing a normalised productivity measure that is comparable across crops, regions and farm sizes.

Methodology

The methodological goal is to design deep learning models capable of predicting expected yields and capturing the effect of extreme weather events by combining structural, management and climate information. 

The approach integrates:

  • Neural networks as predictors: multilayer models able to learn non-linear relationships between a large set of explanatory variables and agricultural yield.
  • Autoencoders for climate series: neural networks that compress weather sequences (C_seq) into compact embeddings, reducing noise and highlighting patterns associated with regular conditions or extreme events.

The pipeline extracts climate embeddings through autoencoders and integrates them with structural and economic variables in the yield prediction model.

Two output directions are explored:

  • Classification of the yield index into discrete classes (low, medium, high),
  • Regression of the continuous value of the Yield Index.

 

Experiments

The experimental analysis evaluates different data configurations, architectures and modelling strategies.

Dataset configurations:

  • Monthly meteorological series combined with categorical and quantitative farm variables.

  • Daily meteorological series enriched with information on irrigation volumes and water use.

Tested architectures:

  • Autoencoder-based models to encode weather sequences over time.

  • Regression models trained on “flattened” sequences without autoencoding.

  • Architectures that include gradient feedback from the decoder to the encoder.

  • Comparison between LSTM autoencoders and Transformer-based encoders for handling time series.

Preliminary experiments show that models trained on monthly sequences exhibit higher instability and larger residuals, especially for crops with fewer observations (e.g. corn and soybean). The use of daily sequences, together with advanced encoders and filtering of extreme yield values, significantly improves the stability and accuracy of predictions.

Results

The final phase introduces Conditional Variational Autoencoders (CVAEs) that integrate daily weather data, water use indicators, crop management practices and geographical information.


Four strategies are evaluated:

  • M1 – Per-species classification: dedicated models for each crop to assign the Yield Index to yield classes.

  • M2 – Multi-species regression with CVAE: a single model to estimate continuous yield values across all crops.

  • M3 – Multi-species classification with CVAE: a shared model that classifies yields of different crops into three classes.

  • M4 – Multi-species classification with geographical context: an extension of M3 with an explicit encoding of territorial information.

The results show that the classification models effectively distinguish three yield levels (low, medium, high), while CVAE-based regression provides stable continuous predictions of the yield index. Including the geographical component in model M4 further improves accuracy and robustness, making it the best-performing approach among those analysed.