Title: " Exploring landslide severity across Washington using machine learning techniques"

Erich Seamon

erichs@uidaho.edu

http://webpages.uidaho.edu/erichs

Introduction and Overview of Dataset

The premise of this research will be to develop a set of models that effectively predict landslide severity, within the state of Washington, using geomorphic and landscape-driven feature variables, such as geology, soils, gradient, land-use, and previous landslide type.

Previous work by Ardizzone et al (2002), Ayalew and Yamagishi (2005), Ohlmacher and David (2003), all used logistical regression as a classifer method for landslide analysis. To expand upon this work, this project will evaluate several differing models to predict landslide severity:

  • logistical regression
  • naive bayes
  • knn
  • random forest
  • decision tree
  • bagged decision tree
  • bagged random forest

Under this approach, landslide explanatory variables will be trained on a training data set of an observed landslide locations, with thematic data such as morphometric attributes (gradient, slope morphology) as well as information on geology and landuse.

The eventual goal will be to create a landslide model that is capable of returning a predictive result in the form of a geographic mapspace (essentially a prediction for each pixel).

Data Sources

2007 and 2008 Landslide data for the state of Washington, containing over 52,000 observations, was accessed from the Washington Department of Natural Recreation (WA DNR). The data was provided as a downloadable .gdb file (geodatabase file).

http://www.dnr.wa.gov/publications/ger_portal_landslides_landforms.zip

Data Transformation Before Analysis

The data was imported as a csv file - that was derived from the provided geodatabase file. This csv file contained over 52,000 observations, with latitude and longitude included as well. The csv file was transformed into a a pandas data frame, with text based categorical fields transformed to numeric values. From this pandas data frame, our feature columns (X) and our response variable (y) were derived.

From the 52,000+ landslide observations, the number was reduced to:

12,241

After eliminating records with NANs and other missing values needed for the analysis.

In [2]:
import PIL
from scipy import stats
import os,sys
import numpy as np
from PIL import Image
from urllib2 import urlopen
In [3]:
basewidth = 1100
img = Image.open(urlopen("http://webpages.uidaho.edu/erichs/dmine-data/Landslide_data_flow.jpg"))
wpercent = (basewidth / float(img.size[0]))
hsize = int((float(img.size[1]) * float(wpercent)))
img = img.resize((basewidth, hsize), PIL.Image.ANTIALIAS)
img.save("landslide_information_flow_refined.jpg")
In [4]:
import os,sys
from PIL import Image
jpgfile = Image.open("landslide_information_flow_refined.jpg")
jpgfile
Out[4]:

Response and Feature Variables:

Morphometric feature variables:

1) Slope Morphology Shape (SLOPE_MORP) - Planar, Convex, Concave, etc.
2) Land Use (LAND_USE) - Forestry, Road/Rail/Trail, Undistubed, Urban Development
3) Landslide Type (LANDSLIDE1) - Debris Flows, Debris Slides and Avalanches, Shallow Undifferentiated, etc.
4) Gradient (GRADIENT_D - transformed to gradient_cat) - gradient of the landslide location, in degrees.
5) Geology (GEOLOGIC_1) - geologic unit
6) Soils (reacch_soi) - soils

Response Variable:

1) Landslide Severity (LANDSLID_3) - Low (0), Med (1) or High (2).

Multiclass classification means a classification task with more than two classes. Multiclass classification makes the assumption that each sample is assigned to one and only one label.

Preliminary Data Import and Analysis

In [23]:
import matplotlib
%matplotlib nbagg

from pyproj import Proj
import StringIO
from pandas import DataFrame
import pandas as pd
import seaborn as sns
import pydot
from IPython.display import Image

from urllib2 import Request, urlopen
import json
from pandas.io.json import json_normalize
import numpy
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
%load_ext memory_profiler
from sklearn.ensemble import RandomForestClassifier
The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler

Import csv file for all 52,000+WA landslides

In [5]:
import csv
In [21]:
walandslides_all = pd.read_csv('http://webpages.uidaho.edu/erichs/dmine-data/WALandslides_export4.csv')
In [22]:
walandslides = pd.read_csv('http://webpages.uidaho.edu/erichs/dmine-data/WALandslides.csv')

Table of all values for initial landslide dataset

In [9]:
walandslides_allxy = [walandslides_all.OBJECTID, walandslides_all.POINT_X, walandslides_all.POINT_Y]
walandslides_allxy = pd.DataFrame(walandslides_allxy)
walandslides_allxy = walandslides_allxy.transpose()
walandslides = pd.DataFrame(walandslides)
walandslides = pd.merge(walandslides_allxy, walandslides, on='OBJECTID', how='outer')

Set landslide type as numeric

In [10]:
stringh1 = set(walandslides.LANDSLIDE1)
J = list(stringh1)
J2 = list(range(1, 106))
In [11]:
i2 = iter(J)
j2 = iter(J2)
k2 = list(zip(i2, j2))
kdict2 = dict(k2)
kdict2.values()
walandslides['LANDSLIDE1'].replace(kdict2, inplace=True)
#print walandslides.LANDSLIDE1

Convert categorical text columns to categorical numerical

In [12]:
walandslides['DATA_CONFI'] = walandslides.DATA_CONFI.map({'Low':0, 'Moderate-High':1})
walandslides['SLOPE_MORP'] = walandslides.SLOPE_MORP.map({'Planar':0, 'Concave-Planar':1, 'Concave, convergent':2, 'Planar-Concave':3, 'Planar-convex':4})
walandslides['LANDSLID_3'] = walandslides.LANDSLID_3.map({'Questionable':0, 'Probable':1, 'Certain':2, 'Unknown':3})

Reduce dataset by eliminating NANs, set feature columns for X

In [13]:
walandslides = walandslides[np.isfinite(walandslides['SLOPE_MORP'])]
walandslides = walandslides[np.isfinite(walandslides['LANDSLIDE1'])]
walandslides = walandslides[walandslides.LANDSLID_3 != 3]
walandslides = walandslides[walandslides.GRADIENT_D != 0]
feature_cols = ['gradient_cat', 'GEOLOGIC_1', 'reacch_soi', 'LAND_USE', 'LANDSLIDE1', 'SLOPE_MORP']

Convert GRADIENT_D to categorical, using quantiles

In [14]:
labelz = ["1", "2", "3", "4", "5"]
gradient = pd.DataFrame(walandslides.GRADIENT_D)
gradient = pd.DataFrame(pd.qcut(walandslides.GRADIENT_D, 5, labels = labelz))
gradient.rename(columns={'GRADIENT_D':'gradient_cat'}, inplace=True)
walandslides = pd.concat([walandslides, gradient], axis=1, join_axes=[gradient.index])

Plot the refined WA landslides (~12000 landslide observations)

In [82]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
%matplotlib nbagg
my_map2 = Basemap(llcrnrlon=-125, llcrnrlat=45, urcrnrlon=-115, urcrnrlat=51, projection='tmerc', lat_1=33, lat_2=45,
     lon_0=-119, lat_0=45, resolution='h', area_thresh=10000)
my_map2.drawcoastlines()
my_map2.drawcountries()
my_map2.fillcontinents(color='coral')
lon2 = np.array(walandslides.POINT_X)
lat2 = np.array(walandslides.POINT_Y)
y2,x2 = my_map2(lon2, lat2)
my_map2.plot(y2,x2, 'ro', markersize=4, markeredgecolor = 'k')
plt.show()

Final X and y - reduced to a size of 12241 observations

In [16]:
walandslides['LAND_USE'] = walandslides.LAND_USE.map({'Forestry related activities':0, 'Undisturbed':1, 'Unknown':2, 'Urban development':3, 'Road, railroad, trail':4})
In [17]:
X = walandslides[feature_cols]
y = walandslides.LANDSLID_3
y = pd.concat([y], axis=1)
X[:5]
Out[17]:
gradient_cat GEOLOGIC_1 reacch_soi LAND_USE LANDSLIDE1 SLOPE_MORP
1 3 29787 55856 4 12 0
2 5 33552 55805 4 3 4
4 1 32784 55804 4 12 0
5 4 30800 55837 4 3 0
6 3 30800 55837 3 12 0

put X and y together for purpose of boxplots later

In [18]:
Xy = pd.concat([X, y], axis=1, join_axes=[X.index])

Data Exploration

Initial breakdown of gradient by degree (categorized)

In [19]:
gradient_pd = X['gradient_cat'].value_counts()
X['gradient_cat'].value_counts()
Out[19]:
3    3044
2    2507
1    2464
5    2239
4    1987
dtype: int64
In [83]:
import matplotlib
%matplotlib nbagg
gradient_pd.plot(kind='bar');

Initial breakdown of geology

In [21]:
geology_pd = X['GEOLOGIC_1'].value_counts()
geology_unique = pd.unique(X.GEOLOGIC_1.ravel())
geology_unique.shape
Out[21]:
(907,)
In [84]:
import matplotlib
%matplotlib nbagg
geology_pd.plot(kind='bar');

Initial breakdown of soils

In [23]:
soils_pd = X['reacch_soi'].value_counts()
soils_unique = pd.unique(X.reacch_soi.ravel())
soils_unique.shape
Out[23]:
(656,)
In [85]:
import matplotlib
%matplotlib nbagg
soils_pd.plot(kind='bar');

Initial breakdown of slope morphology

In [25]:
slopemorp_pd = X['SLOPE_MORP'].value_counts()
X['SLOPE_MORP'].value_counts()
Out[25]:
2    6526
0    5199
4     516
dtype: int64
In [86]:
import matplotlib
%matplotlib nbagg
slopemorp_pd.plot(kind='bar');

Initial breakdown of land use

In [27]:
landuse_pd = X['LAND_USE'].value_counts()
X['LAND_USE'].value_counts()
Out[27]:
0    5981
2    3199
4    2479
1     529
3      53
dtype: int64
In [87]:
import matplotlib
%matplotlib nbagg
landuse_pd.plot(kind='bar');

Initial breakdown of landslide type

In [29]:
landslidetype_pd = X['LANDSLIDE1'].value_counts()
X['LANDSLIDE1'].value_counts()
Out[29]:
12    5496
11    2262
10    1913
3     1542
13     434
9      284
6      125
1      109
2       33
5       28
8       15
dtype: int64
In [88]:
import matplotlib
%matplotlib nbagg
landslidetype_pd.plot(kind='bar');

Response Variable - LANDSLID_3 - Landslide severity (0=Low, 1=Med, 3=High)

In [31]:
landslid3_pd = y['LANDSLID_3'].value_counts()
y['LANDSLID_3'].value_counts()
Out[31]:
2    7501
1    3806
0     934
dtype: int64
In [89]:
import matplotlib
%matplotlib nbagg
landslid3_pd.plot(kind='bar');

run train test split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

Begin Model analysis

Calculate Null Accuracy

In [34]:
import time
start = time.time()

nulltime1 = %%timeit -o 1 + 2
null = y_test['LANDSLID_3'].value_counts().head(1) / len(y_test['LANDSLID_3'])
print null

end = time.time()
nulltime1 = end - start
print nulltime1
10000000 loops, best of 3: 29.7 ns per loop
2    0.613198
dtype: float64
1.66566896439

10-fold cross-validation with logistic regression

In [35]:
y = np.ravel(y)
In [36]:
import time
start = time.time()

logregtime1 = %%timeit -o 1 + 2
#%%memit
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X,y)
logreg_scores_mean = cross_val_score(logreg, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy').mean()

end = time.time()
logregtime1 = end - start
print logregtime1
The slowest run took 40.07 times longer than the fastest. This could mean that an intermediate result is being cached 
10000000 loops, best of 3: 29.8 ns per loop
3.05441808701
In [37]:
print logreg_scores_mean
0.612775243431

Naive Bayes Multinomial

In [38]:
import time
start = time.time()

from sklearn.naive_bayes import MultinomialNB
nbm = MultinomialNB()
nbm.fit(X, y)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
nbm_scores = cross_val_score(nbm, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy')

end = time.time()
nbmtime1 = end - start
print nbmtime1
0.212238788605
In [39]:
nbm_mean_score = nbm_scores.mean()
print nbm_mean_score
0.401841069761

10-fold cross-validation with K Nearest Neighbor

In [40]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=9)
#scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
knn_scores = cross_val_score(knn, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy')
print knn_scores
[ 0.67918367  0.69444444  0.6870915   0.70179739  0.67973856  0.69689542
  0.69444444  0.69934641  0.69199346  0.70424837]
In [41]:
mean_score = knn_scores.mean()
print mean_score
0.692918367347

Search for an optimal value of K for KNN

In [42]:
# search for an optimal value of K for KNN
k_range = range(1, 31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy')
    k_scores.append(scores.mean())
print k_scores
[0.7251043083900226, 0.66154568494064292, 0.69234627184207009, 0.67706762705082035, 0.69920541549953319, 0.68678964919301055, 0.69577597705748961, 0.68662658396692022, 0.69259003601440572, 0.69160984393757508, 0.69528631452581036, 0.68932466319861274, 0.695858143257303, 0.69201880752300926, 0.69740949713218625, 0.68973042550353481, 0.69356989462451635, 0.69128384687208211, 0.69169187675070032, 0.69373436041083092, 0.68948599439775904, 0.68711698012538336, 0.68940589569160993, 0.68924229691876748, 0.68752320928371347, 0.68989455782312936, 0.68540302787781771, 0.68703561424569837, 0.68531966119781251, 0.68107036147792455]
In [43]:
import matplotlib.pyplot as plt
%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
Out[43]:
<matplotlib.text.Text at 0x7fb9459a7dd0>

Re-run KNN with an optimized k=1

In [44]:
import time
start = time.time()

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X,y)
knn_scores = cross_val_score(knn, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy').mean()

end = time.time()
knntime1 = end - start
print knntime1
0.402378082275
In [45]:
print knn_scores
0.724614712552

Decision Tree - Fit a classification tree with an initial max_depth=3 on all data

In [46]:
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
scores = cross_val_score(treeclf, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy')
treeclf.fit(X, y)
print scores.mean()
0.665875083367

Decision Tree - search for an optimal gamma/depth for Decision Tree

In [47]:
from sklearn.tree import DecisionTreeClassifier
clf_range2 = range(1, 100)
clf_scores2 = []
for k in clf_range2:
    clf = DecisionTreeClassifier(max_depth=k)
    clfscores2 = cross_val_score(clf, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy')
    clf_scores2.append(clfscores2.mean())
print clf_scores2
[0.64382072829131654, 0.65395044684540493, 0.66587741763372033, 0.67012525010004009, 0.67257469654528479, 0.6755991063091904, 0.67927571028411371, 0.6963478724823261, 0.69749059623849541, 0.70696938775510199, 0.70860250766973454, 0.71252347605708954, 0.72151007069494455, 0.71685347472322258, 0.72641116446578624, 0.72739155662264898, 0.72869821261838064, 0.72837148192610379, 0.73065986394557825, 0.73278611444577835, 0.73482626383886895, 0.7332734427104175, 0.73841990129385093, 0.73262218220621589, 0.73033159930638925, 0.73074223022542351, 0.73212931839402429, 0.72722689075630265, 0.7299238362011472, 0.72657469654528473, 0.73490529545151395, 0.7333548085901026, 0.73727390956382544, 0.72959837268240635, 0.73204748566093114, 0.73090409497132192, 0.73384573829531807, 0.73025196745364818, 0.7320498866213152, 0.7348259970654929, 0.7353155929038282, 0.73188482059490456, 0.73368240629585169, 0.73825603574763243, 0.73115059357076162, 0.73302894491129789, 0.73547819127651071, 0.73033373349339725, 0.73850146725356813, 0.73474142990529534, 0.73229338402027477, 0.72820794984660531, 0.73441663331999474, 0.72927117513672146, 0.73016866746698672, 0.73515159397092167, 0.73204748566093092, 0.73539775910364147, 0.73368133920234757, 0.7319652527677738, 0.73180405495531542, 0.72935314125650252, 0.73188281979458447, 0.72861711351207137, 0.72935267440309448, 0.7313944244364412, 0.73082286247832473, 0.72967973856209156, 0.73515086034413757, 0.73041436574629859, 0.73237388288648797, 0.73000660264105643, 0.73384647192210228, 0.73637941843404031, 0.73082139522475653, 0.7315564892623716, 0.72788028544751238, 0.73311024409763914, 0.73433653461384552, 0.72861858076563957, 0.73360070694944635, 0.73425176737361597, 0.73082186207816457, 0.73073976257169537, 0.73025050020008009, 0.73359983993597433, 0.73033413365346145, 0.73441629985327472, 0.73433653461384563, 0.73417120181405893, 0.72771748699479799, 0.73743964252367622, 0.73270208083233312, 0.72861858076563957, 0.72894371081766041, 0.7327833133253302, 0.73343597438975594, 0.73245658263305324, 0.73008636788048553]

Decision Tree - Plot accuracy of cross validation runs vs. values of depth

In [90]:
import matplotlib
%matplotlib nbagg
plt.plot(clf_range2, clf_scores2)
plt.xlabel('Value of Depth for Decision Tree')
plt.ylabel('Cross-Validated Accuracy')
Out[90]:
<matplotlib.text.Text at 0x7fb8ea5dfad0>

Decision Tree - Model optimization results:

After examining accuracy for a variety of depths, its appears that a value of ~41 for a max depth is optimal in terms of cross validation accuracy.

In [49]:
import time
start = time.time()

clftime1 = %%timeit -o 1 + 2
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=41, random_state=1)

treeclf_scores = cross_val_score(clf, X, y, cv=KFold(X.shape[0], n_folds=10, shuffle=True), scoring='accuracy')
treeclf.fit(X, y)

end = time.time()
clftime1 = end - start
print clftime1
10000000 loops, best of 3: 29.7 ns per loop
1.77246594429
In [50]:
treeclf_scores_mean = treeclf_scores.mean()
print treeclf_scores_mean
0.735886954782
In [51]:
from StringIO import StringIO
tree_landslide = StringIO()

from sklearn.tree import DecisionTreeClassifier, export_graphviz
export_graphviz(treeclf, out_file=tree_landslide)
In [52]:
graph = pydot.graph_from_dot_data(tree_landslide.getvalue())  
Image(graph.create_png())
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.179177 to fit

Out[52]: