Methodology Step 1: General Hypotheses

Agricultural systems are an essential and growing aspect of our society. Not only is agriculture going through a data-centric revolution (Smith and Katz, 2013), the methods and analysis approaches for such efforts are also becoming much more complex (Gray, 2006, Bell, 2009, Clarke, 2009, Maimon and Rokach, 2010). Precision agriculture systems, cloud-based data assembly for farmers, and machine learning algorithms for predictive analytics, are all example areas of scientific discovery that are pushing efficient agricultural systems forward. This research builds upon this data growth, through the development of a modular data mining and machine learning methodology, initially focused on agricultural systems. The proposed methodology will be applied to 1) irrigated and 2) dryland agricultural systems in the Pacific Northwest region, stepping thru the processes of data assembly and geographic characterization, feature transformation and engineering, classifier/regressor selection, optimization, tuning, and finally, incorporation into a custom application programming interface (API). Each model and API will use climate outcomes to predict agricultural crop loss, estimating the influence of these changing conditional relationships over time. (e.g. how influential is drought on crop loss for a particular county, and does that influence change into the future?). Finally, the API, models, and analytics are integrated into a technology platform for access by land managers, farmers, or scientists, with the added capability of extending the methodology to other climate impact areas, such as health or land subsidence.

The research question focuses on agricultural systems and their impact given a changing climate.  In this methodology example, we explore crop insurance claim losses that are submitted to the USDA , and how these data might be associated with climate data for a related time period.

In term of a hypothesis statement, we can frame our statement as such:

“There is a predictive relationship between historical climate variables across the inland pacific northwest, and agricultural crop loss insurance claims “


Four state region of Washington, Oregon, and Idaho as well as western Montana, where ag commodity loss example analysis is focused.

Agricultural systems and their products are essential components to our society.  In 2014, the U.S. agricultural sector created a gross output of more than 835 billion dollars, and had an employee base of approximately 750,000 people.  With roughly 2 million farms in the US, with an average size of about 435 acres, total grain production alone was $436 million (USDA Economic Research Service, 2014).  For the state of Washington, agriculture production value exceeded $10 billion, with over 160,000 jobs, making up 13-15 percent of the state’s economy each year. (WSDA drought report, 2015).  Similarly, the Washington forestry support industries generated over $1.8B in total economic impacts across the same time frame.

This hypothesis proposes the development of a spatially explicit machine learning model to better understand the relationships of climate outcomes with agricultural commodity loss.  The USDA’s archive agricultural commodity loss (1980-present), with associates climate damage causes, will be used as a data source for our initial predictor, used as an estimate of agricultural climate impact.

Data Acquisition

Data acquisition, transformation, and integration will be an important step for the proposed model development.  Several key datasets have been initially identified, including:

  • The USDA’s agricultural crop loss data archive (1980-2016). The USDA’s Risk Management Agency has insurance claim records associated with commodity crop loss from 1980 to 2016.  Specifically, we are using the cause of loss archive datasets, which are .csv files which summarize insurance claims by month and by county.  This data is available for the entire United States.
  • Downscaled gridded climate datasets from 1980-2016. There are several screen-shot-2016-12-02-at-10-00-21-pmexcellent resources that provide gridded meteorological data for both historical and future scenarios (Abatzoglou, 2010, Thorton et al, 2014).  With regards to this analysis, we will use GRIDMET data, in combination with other relevant datasets that might assist in understanding crop loss variability and spatial patterns.

*Abatzoglou J.T. and Brown T.J. “A comparison of statistical downscaling methods suited for wildfire applications ” International Journal of Climatology (2012),doi: 10.1002/joc.2312.  The dataset MACAv2-METDATA was produced with funding from the Regional Approaches to Climate Change (REACCH) project and the SouthEast Climate Science Center(SECSC). 

Statistical Modeling

For this supervised example, our goal is to develop a finely tuned hypothesis predictor function h(x).  Learning consists of using mathematical algorithms to optimize this function so that, given input data x about a certain domain (say, a county with a maximum temperature above a certain value), it will attempt to predict x (agricultural crop loss for a commodity, county, and season).

In practice, x almost always represents multiple data variables. So, for example, to predict agricultural crop loss, we want to climatically include not just max temp (x1) but also

-solar radiation (x2),
-palmer drought suitability index (x3),
-relative humidity (x4),
-potential evapotranspiration (x5),
-fuel moisture (x6),

and so forth.

Determining which feature variables to ultimately use (as well as their spatial and temporal constraints) is an important part of this design.

Let’s say our simple predictor has this form:

where Θ1 and Θ2 are constraints (such as a time period and spatial extent). Our goal is to find the perfect values of Θ1 and Θ2 to make our predictor work as well as possible. (e.g. what temporal constraints should we place on our selection of climate data given we are exploring wheat? How many days previous should we include?)

Optimizing the predictor h(x) is done using cross validated training and testing.  For each training example, we have an input value x_train, for which a corresponding output, y,(crop loss), is known in advance. For each example, we find the difference between the known, correct value y, and our predicted value h(x_train). With enough training examples, these differences give us a useful way to measure the “wrongness” of h(x). We can then tweak h(x) by tweaking the values of Θ1 and Θ2 to make it “less wrong”. This process is repeated over and over until the system has converged on the best values for Θ1 andΘ2. In this way, the predictor becomes trained, and is ready to do some real-world predicting.

Dashboard and API Construction

We are constructing two dashboards for agricultural analysis:

Agriculture Data Discovery Dashboard: Analytic dashboard that explores agricultural commodity systems data compared to a variety of related variables.

Agriculture Prediction Dashboard: Streamlined widget/dashboard that provides explicit predictions on insurance commodity losse claim counts – by commodity type.