a year ago

March 18, 2016

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter *k*. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of *k* that one should select to solve their problem?

This mini-episode explores the appropriate value of *k* to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.

In [9]:

```
%matplotlib inline
import os
import json
import pickle
import geocoder
import math
import random
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
import matplotlib.pyplot as plt
```

In [3]:

```
addresses = {}
```

In [4]:

```
root = '../data/'
allfiles = []
dirs = os.listdir(root)
for d in dirs:
files = os.listdir(root + d)
for f in files:
allfiles.append(root + d + '/' + f)
```

In [544]:

```
random.shuffle(allfiles)
len(allfiles)
```

Out[544]:

In [45]:

```
def persist(addresses):
save = {}
for key in addresses.keys():
addr = addresses[key]
val = addr.latlng
val.append(addr.accuracy)
save[key] = val
#
h = open('addresses.pkl', 'wb')
pickle.dump(save, h)
h.close()
```

In [120]:

```
def flatten(sarr):
if type(sarr) == str or type(sarr) == unicode:
return sarr
res = ''
for s in sarr:
s = s.strip()
if len(s) > 0:
if len(res) > 0:
res += s + ', '
else:
res = s
return res
```

In [314]:

```
def numClean(x):
if x is None:
return None
if x == '':
return None
if type(x) == float:
return x
try:
return float(x.replace('$', '').replace(',', ''))
except ValueError:
return None
```

In [306]:

```
def clean_details(details, blockLLMap):
c = 0
details['Legals'] = flatten(details['Legals'])
if details.has_key('Improvements'):
di = details['Improvements']
if len(di)>0:
imps = di[0]
for key in imps.keys():
val = imps[key]
details['improve_' + key] = val
del details['Improvements']
convertFields = ['SalePrice', 'FIXEVEV', 'FIXTVAL', 'HOEVAL', 'IMPROVAL', 'LANDVAL', 'PPEXEVAL', 'PPVALUE', 'REEXEVAL', 'improve_BDLBATH', 'improve_BDLBDRM', 'improve_BDLEFFYR', 'improve_BDLSQMAIN', 'improve_BDLUNITS', 'improve_BDLYRBLT']
for cf in convertFields:
if details.has_key(cf):
details[cf] = numClean(details[cf])
head = int(details['AIN'])/1000 * 1000 + 1
try:
ll = blockLLMap[head]
details['latitude'] = ll.lat
details['longitude'] = ll.lng
except KeyError:
return details
return details
```

In [152]:

```
errcount = 0
c = 0
blockLLMap = {}
alldetails = []
for f in allfiles:
try:
data = json.loads(pickle.load(open(f, 'r')).text)
details = data['results']['ParcelDetails']
if details:
if int(details['AIN']) % 1000 == 1:
addr = details['Address1'] + ' ' + details['Address2']
try:
loc = addresses[addr]
c += 1
except KeyError:
print 'Lookup', addr
loc = geocoder.google(addr)
addresses[addr] = loc
blockLLMap[int(details['AIN'])] = loc
if c % 1000 == 0:
print 'Completed', 100.0 * c / len(allfiles), errcount
alldetails.append(details)
except:
print 'Error on: ', f
errcount += 1
if errcount>10:
break
# Since I did a bunch of slow API lookups, let's cache this in case I want to use it again in the future.
persist(addresses)
```

In [214]:

```
# Do data cleaning
errcount = 0
c = 0
all2 = []
for details in alldetails:
details = clean_details(details, blockLLMap)
all2.append(details)
df = pd.DataFrame(all2)
```

In [254]:

```
# See how many latitudes are missing given my assumption
df['latitude'].apply(lambda x: math.isnan(x)).sum()
```

Out[254]:

In [ ]:

```
c=0
gaps = df[df['latitude'].apply(lambda x: math.isnan(x))]
print gaps.shape
for r in range(gaps.shape[0]):
row = gaps.iloc[r]
addr = row['Address1'] + ' ' + row['Address2']
try:
loc = addresses[addr]
c += 1
except KeyError:
print 'Lookup', addr
loc = geocoder.google(addr)
addresses[addr] = loc
blockLLMap[int(details['AIN'])] = loc
if c % 1000 == 0:
print 'Completed', 100.0 * r / gaps.shape[0], errcount
```

In [225]:

```
df['AIN'] = df['AIN'].astype(int)
```

In [226]:

```
# save df to csv
df.to_csv('details.csv', sep='\t', index=False, encoding='utf-8')
```

In [ ]:

```
# all on block are considered 0
```

In [545]:

```
# Sample of the data
df.head()
```

Out[545]:

In [266]:

```
df.shape
```

Out[266]:

In [546]:

```
# Filter down to those I have a price for
f1 = df['SalePrice'] > 0
f1 = df['SalePrice'] > 0
f2 = df['latitude'].apply(lambda x: not(math.isnan(x)))
df2 = df[f1 & f2]
print 'Remaining after filter:', 1.0 * df2.shape[0] / df.shape[0]
print df2.shape[0]
```

I have a disappointingly small subset, but let's see if we can make the most of it.

*nearest* neighbors, this error might be minimized, but I'm not sure if it will be trivial. Yet, I'm skipping this step for now, since the objective in this particular mini-episode is to talk about the Elbow Method, not about distance measures.

In [344]:

```
# load to k-d tree
lat = df2['latitude']
lng = df2['longitude']
df2.index = np.arange(df2.shape[0]) # Just to make sure the index is useful to us
k=2
kdt = cKDTree(zip(lat,lng))
```

In [540]:

```
maxk = 25
```

In [541]:

```
# iterate over k, calculate average price, store mean and accuracy
results = []
for k in range(1, maxk):
for r in np.arange(df2.shape[0]):
row = df2.iloc[r]
sp = row['SalePrice']
distances, neighbors = kdt.query((row['latitude'], row['longitude']), k=k)
if type(distances) == float:
distances = pd.Series(distances)
if type(neighbors) == int:
neighbors = pd.Series(neighbors)
if sum(neighbors==r):
selfidx = np.where(neighbors == r)[0][0]
neighbors = pd.Series(neighbors)
neighbors.drop(selfidx, inplace=True)
avg_dist = distances.mean()
#avg_sp = df2.iloc[neighbors]['SalePrice'].mean()
avg_sp = math.sqrt(sum((sp - df2.iloc[neighbors]['SalePrice']).apply(lambda x: math.pow(x, 2))))
result = [sp, k, avg_dist, avg_sp, len(neighbors)]
results.append(result)
```

In [542]:

```
rdf = pd.DataFrame(results)
rdf.columns = ['SalePrice', 'k', 'adist', 'avgNeighborSalePrice', 'numNeighbors']
# There are some extreme outliers, so let's trim the top and bottom
low, high = rdf['SalePrice'].quantile([.05, .95]).tolist()
rdf = rdf[rdf['SalePrice'] > low]
rdf = rdf[rdf['SalePrice'] < high]
rdf['delta'] = rdf['avgNeighborSalePrice'] - rdf['SalePrice']
rdf['sq_err'] = rdf['delta'].apply(lambda x: abs(x))
low, high = rdf['delta'].quantile([.05, .95]).tolist()
rdf = rdf[rdf['delta'] > low]
rdf = rdf[rdf['delta'] < high]
gb = rdf.groupby('k')['sq_err'].mean()
```

In [543]:

```
plt.figure(figsize=(10,5))
plt.bar(gb.index, gb)
plt.xlabel('k=')
plt.gca().xaxis.grid(False)
plt.title('Mean square error ')
plt.xticks(gb.index + 0.4, gb.index)
plt.xlim(.75, maxk)
plt.show()
```

The figure above plots the mean square error of the price of a home to it's k nearest neighbors. Naturally, this diverges as k increases, because it means we're searching further and further away from the home for comparables, which means homes that are less comparable in terms of location.

Yet, we find a small minimium with k=3, which would imply that the optimal number of nearest neighbors to use to estimate the price of a home based *strictly* on price alone, is three.

The reader should be a bit skeptical of this result, because we have some unstated assumptions about the data. First, we assume the data source is good. Yet, it's a questionable usage. I got it by crawling an unpublicized API of the assessor's office. It clearly has large gaps (homes without recent sale prices). I have no explanation for these gaps. If they were random ommisions, that would not be so bad, but I suspect their introducing a bias to the dataset.

I also don't account for trend or seasonality. Because my dataset got down to be rather small compared to the total population of homes in LA, I used all the data. In reality, more recent data would be better (perhaps the last 3 years?). Additionally, home sales have seasonal components that I didn't account for. Thus, the nearest neigbors might have all sold during a more favorable time in the market, and are thus, not exactly comparable to the home I'm estimating.

I'm also only comparing on price. Euclydian distance (which has it's own problems stated above) is the only metric. A one bedroom condo might be compared to a 4 bedroom single family dettached home by this method, if it happens to be closest. This is clearly not ideal and will introduce a significant amount of variance into my results.

So where is the elbow? Well, it's arguable at k=3. One might say this is not actually an "elbow method" problem because this method is typically used for monotonically decreasing trade-offs. In this way the more common use in k-means clustering would have been a better example. For a discussion of the more ideal use cases, please listen to this episode.

Lastly, consider these show notes an demonstration of technology and technique over actual result. There are many things to be skeptical of in this analysis, but I learned a few things a long the way that will go into our upcoming home purchase decisions.

Let's dust off the old iris dataset and see how The Elbow Method performs in it's more typical use case of identifying the k for k-means clustering.

In [1]:

```
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
```

In [60]:

```
X = iris.data
rsqs = []
ks = range(2,20)
for k in ks:
km = KMeans(n_clusters=k)
fit = km.fit(X)
# Calculate the R-squared for the cluster, which is explained (within variance) over total variance
labels = fit.predict(X)
y = pd.DataFrame(labels, columns=['cluster'])
x = pd.DataFrame(X, columns=['x0','x1','x2','x3'])
df = pd.concat([x, y], axis=1)
centroids = df.groupby(['cluster']).mean()
centroids.columns = ['c0', 'c1', 'c2', 'c3']
mcentroid = df[['x0', 'x1', 'x2', 'x3']].mean()
df.set_index('cluster', inplace=True)
df = df.join(centroids, how='outer')
# sum of squared error within each cluster
df['sse'] = (df['x0']-df['c0']).apply(lambda x: math.pow(x, 2)) \
+ (df['x1']-df['c1']).apply(lambda x: math.pow(x, 2)) \
+ (df['x2']-df['c2']).apply(lambda x: math.pow(x, 2)) \
+ (df['x3']-df['c3']).apply(lambda x: math.pow(x, 2))
# sum of squared error total
df['m0'] = mcentroid['x0']
df['m1'] = mcentroid['x1']
df['m2'] = mcentroid['x2']
df['m3'] = mcentroid['x3']
df['tse'] = (df['x0']-df['m0']).apply(lambda x: math.pow(x, 2)) \
+ (df['x1']-df['m1']).apply(lambda x: math.pow(x, 2)) \
+ (df['x2']-df['m2']).apply(lambda x: math.pow(x, 2)) \
+ (df['x3']-df['m3']).apply(lambda x: math.pow(x, 2))
rsq = 1 - df['sse'].sum() / df['tse'].sum()
rsqs.append(rsq)
```

In [61]:

```
plt.plot(ks, rsqs)
plt.xlabel('Number of clusters')
plt.ylabel('R-Squared')
plt.xlim(1,20)
plt.show()
```

Enjoy this post? Sign up for our mailing list and don't miss any updates.