Methodology Step 2: Data Assembly and Inventory

USDA Commodity Loss Insurance claim data structure (USDA RMA, 2016)

The assembling of data and transforming for use within a machine learning model is an important part of this overall methodology, and is not a trivial. Understanding which features are important, organizing said features in a matrix structure for analysis, converting disparate variables to numeric or categorical, all can be difficult and time consuming.  In addition, feature engineering, or the transforming/combining of existing variables into new predictors, can be a critical part of the data assembly process. When combined or organized in a particular way, engineered features can help to improve the performance of your model.

Data Assembly

In Step 2, we begin the assembly process of all of our datasets for this agricultural analysis, as described in Step 1, including;

  1. climate array data for 2001-2015;
  2. agricultural crop loss data for 2001-2015, on a monthly county basis.


For this Step 2, we do a lot of transformation in R.   We use four scripts:

  1. DMINE-croploss-met.R.  Clips, extracts, merges and combines climate data with ag data.  summarizes data files and creates png for referral later.
  2. DMINE-croploss-timelapse.R.  This script, we create crop loss maps, bar charts and time lapses for individual states.
  3. DMINE-gridmet-monthly-rev6_threestate.R: does the same as above, except it merges all the states and data and creates one three state animation time-lapse.  Not as easy as it sounds!
  4. DMINE-palouse.R  Exacts palouse data for the 28 county region and prepares it for use.

Each of these scripts are used to generate a data file or some output that is used in a dashboard.

In Step 3, we develop dynamic feature extraction and feature transformation approaches.  This is an important step in order to organize the best set of feature or independent variables with our associated predicted or dependent variable.