After data assembly and initial EDA in Step 3, we have prepared our datasets for feature extraction and transformation. As part of our initial EDA, we refined our study area temporally and spatially, by reducing our region to a set of 26 (twenty-six) counties in Idaho, Washington, and Oregon, were selected from 2007-2015 (See inset map below).
Step 4 focuses on preparing data for model construction. This includes
- feature selection, feature extraction and feature transformation on the subsetted 26 county area;
- Construction of an algorithm to select the best climate data (explanatory, or independent variables) in association with our crop loss data (predictor, or dependent variable)
What do we mean by Feature selection and Feature extraction? This involve reducing the amount of resources required to describe a large set of data. Feature extraction is a general term for differing methods of constructing combinations of variables to get around dimensionality problems while still describing the data with sufficient accuracy. More sophisticated approaches include:
- dimensionality reduction using a wide spectrum of statistical techniques (least squares, principal component analysis, multi-factor dimensionality reduction, nonlinear dimensionality reduction)
- feature engineering. This is where an expert in the topical discipline helps to shape how features might be extracted and transformed in a way that best represents the predictor paradigm. In our instance, we want to automate this feature extraction and engineering.
Feature Extraction, Agricultural Crop Loss and Climate Data
For our example of agricultural commodity loss and climate data, the issue of feature construction/extraction is essential. Our feature selection question really zeros in on this issue:
What are the appropriate spatial and temporal relationships between a particular commodity (Wheat) for a location (Whitman County, WA) in time (July 2009)?
You can see how each insurance claim is an individual record of crop loss. Each claim record has, at minimum:
- a loss value,
- an acreage,
- a commodity type, and
- why the loss occurred.
Each record is at a monthly, county level. We want to be able to associate the most impactful/representative vector of climate data to each claim record.
Why do we want to do this? Our goal is to predict crop loss using climate data. If we want to do that, then we need to best aggregate our climate data and its role in relationship to the commodity in question.
In our example, we are looking at wheat. If we want to best select the right climate variables spatially and temporally for this commodity, we want to know more about wheat growth.
- what is the length of time that wheat takes to grow?
- Can we assume that the date the claim is made in the data reflects the actual time the crop was in the field? (e.g. if a claim was made for December 2009, can we assume that this claim is for a crop that was recently in the field in close temporal proximity to December 2009?)
- Are there differences in terms of Spring and Fall Wheat season growth patterns?
- Are there weighted importance values for the Wheat phenology cycle that we want to apply to our feature vector?
In order to extract the most appropriate features for each record, we need to develop a type of algorithm which can extract climate data based on the a defined commodity seasonal algorithm.