DMINE Methodology – Step 6: Climate Algorithm Development

In step 6 we refine our initial feature selection process thru the development of a climate extraction algorithm.  This algorithm is an iterative work in progress, with a current model in place that manually implements climate selection based on known scientific relationships.  We are progressing towards a machine learning approach that will evaluation a range of climate combination options as it relates to agricultural commodity loss.


Wheat as an Example Algorithm Commodity

In order to develop a decent algorithm to extract climate data, we need to choose an example commodity.  For our purposes, we are using wheat across the palouse region.

We initially need to get a sense of wheat phenology breakdowns. Below are the three main ways of scaling wheat growth:

  • The Haun scale growth stages key on rate of development of the main shoot. (Haun, 1973)
  • The Feekes scale recognizes eleven major growth stages starting with seedling emergence and ending with grain ripening.  The Feekes scale is frequently used to identify optimum stages for chemical treatments, such as fungicide applications. (Large, E.G., 1954)
  • The Zadoks scale provides the most complete description of wheat plant growth stages.  It uses codes based on ten major stages that can be subdivided, making it particularly suited for computerization. (Zadoks, 1974).

Given that we are examining wheat, the number of growing degree days will give us a decent idea of how how long the growing season is for a particular spring or winter seasonal time frame, and what possible stage a crop may have been in when a claim was filed.



Zadok cereal growth stage scale (Zadok, 1974)



Wheat Algorithm Development

OK so lets make some decisions:

  • Reduce our spatial area.  We have already done this.  Through our initial EDA, we determined that the palouse 26 county region within Oregon, Washington, and Idaho, is a homogeneous landscape that provides a good dryland wheat study area for crop loss analysis.
  • Lets organize our data by year.  Since crop loss is commodity-based, we want to group our monthly observations by season (Spring and Winter).  To simplify, we will aggregate all claims by the year (still factored by county).
  • Lets only focus on Wheat as a commodity.  Wheat is the most economically influential commodity in the Pacific Northwest, and has the most amount of crop loss in terms of $ and claim frequency.
  • Lets only focus on Drought associated claims.  While we can look as all damage causes, lets only look at drought for now.
  • Lets only use climate data for locations where Wheat was grown.  Using the USDA’s annual cropland data layer (CDL), we can eliminate climate data that is not in locations where wheat is not grown that year.
  • Lets further refine our climate data by only choosing climate data from October-June, and only claims data from March-October.
    • We do this because we are really interested in claims from March-October, which is the time frame farmers are most effected by drought.
    • We choose climate data from October-June because that is the long-term water recharge cycle, when soil columns are slowly filling up with water, in preparation for its use in low-rainfall months.

Algorithm Steps

Given the above presumptions, we want to define our algorithm to extract and subset climate data in this manner.  If we describe our process linearly, and as specifically as possible, we might say:

For each insurance claim record, which is categorized by month, year, state, county, commodity (wheat), and damage cause (drought) – do the following:

  1. extract commodity-specific map data by year for each county from the annual cropland data layer GIS raster map (USDA 2007-present).  This results in an annual commodity map, for each year for the length of time for analysis.
  2. Use the aforementioned annual commodity map to extract climate daily climate data for the length of time for analysis, for only the locations where the commodity was grown for that year.
  3. Summarize is geographically filtered climate data by month and county, and export to a tabular matrix.
  4. Extract crop insurance data for the length of time in question and export to a tabular matrix.
  5. construct a vector of climate data for EACH crop insurance claim record, that appropriately reflects that claims climate relationship.  For example: for a claim in Whitman County, in July, 2009, for WHEAT:
    1. Take the monthly date of the crop insurance claim.  Determine if the claim is a SPRING claim or a WINTER claim.  Summarize all current season climate claims for all days previous to the claim, to the beginning of the current season. Weight these claims as x.  Summarize all previous season climate claims for all days previous to the claim, to the beginning of previous season. Weight these claims as y.
    2. Summarize all weighted claims as a mean, for each climate variable.  That vector is associated to the the individual insurance claim in question.
    3. Perform this process for each insurance claim.

As a result of the above algorithm, a matrix of data was generated, for 2007-2015 of the palouse region.

Updated data matrix after algorithm construction

Refined Exploratory Data Analysis

Research Notebook 1: Refined EDA

As noted previously Exploratory Data Analysis (EDA) is a critical component to data mining and analysis – and not just once.  In this example, we return to EDA and perform another round of analysis on our refined dataset.  (Remember, we initially examined all crop claims for drought and wheat in general, in Step 2.  We also initially examined climate data that was of the same time frame as when the insurance claim was made).

For this refined EDA, we have used R to perform a set of analyses on our 2007-2015 crop claim/climate data, including basic correlational relationships, regression analysis, analysis of variance and and collinearity.  You may review the analysis below or by clicking on the Research Notebook 1 image above.

Research Notebook 1: Exploratory Agricultural Data Analysis and Regression of Palouse Region, focusing on wheat and drought insurance claims

Step 6 Conclusions

From our refined EDA, you can see that we are seeing some expected significant relationships between crop loss and precipitation, as well as PDSI and pr – pet.  These correlations bolster our view that the refined dataset and algorithm are working to better choose climate data for predictive crop loss efforts.  With this refined dataset, we move to Step 4 and model construction.