5 months ago

In [48]:

```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
import requests
import StringIO
```

Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it's a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding whether or not Yoshi the parrot will like a new chew toy. A few other everyday examples help us examine why entropy is a nice metric for constructing a decision tree.

If we consider the example of a coin toss, the we have

Thus...

In [41]:

```
prob_h = 1.0 * np.arange(200) / 200
h_x = - prob_h * map(lambda x: 0 if x==0 else math.log(x, 2), prob_h) + -1 *(1-prob_h) * map(lambda x: math.log(x, 2), (1-prob_h))
plt.plot(prob_h, h_x)
plt.xlabel("Probability of Heads")
plt.ylabel("Entropy")
plt.show()
```

It's interesting to consider, instead, a six sided die.

Let

In this way, we can see that there is more entropy in a fair die because we require more bits of information to describe the outcome.

In [43]:

```
r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data')
```

In [62]:

```
handle = StringIO.StringIO(r.content)
names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'car name']
df = pd.read_fwf(handle, colspecs='infer', names=names)
```

In [63]:

```
df.head()
```

Out[63]:

`fuel_economy`

which considers anything about 25 miles per gallon good, and everything else bad.

In [67]:

```
df['fuel_economy'] = df['mpg'].apply(lambda x: 'good' if x > 25 else 'bad')
```

Let's first calculate the entropy of this metric before considering any of the features.

In [90]:

```
agg = df.groupby(['fuel_economy']).agg({'mpg': ['count']})
norm = agg / agg.sum()
entropy_initial = norm['mpg']['count'].apply(lambda x: -x * math.log(x)).sum()
print(entropy_initial)
```

We can calulate the initial entropy to be 0.67177.

Now, let's consider the entropy of the dataset if we split it based on the number of cylinders in the car (which can be 3, 4, 5, 6, or 8).

In [210]:

```
agg = df.groupby(['cylinders', 'fuel_economy']).agg({'mpg': ['count']})
agg.reset_index(inplace=True)
cylinders = []
entropies = []
for c in set(df.cylinders):
aggc = agg[agg.cylinders == c].copy()
aggc.reset_index(inplace=True)
aggc['p'] = 1.0 * aggc['mpg']['count'] / aggc['mpg']['count'].sum()
values = map(lambda x: -1 * (x * math.log(x, 2)), aggc.p)
entropy = reduce(lambda a, b: a + b, values)
cylinders.append(c)
entropies.append(entropy)
result = pd.DataFrame({'cylinders': cylinders, 'entropy': entropies})
result.sort_values(by='cylinders', inplace=True)
result.set_index('cylinders', inplace=True)
observations = pd.DataFrame(df.groupby(['cylinders'])['mpg'].count())
observations.columns = ['count']
df2 = pd.concat([result, observations], axis=1)
df2['freq'] = 1.0 * df2['count'] / df2['count'].sum()
df2['weighted_entropy'] = df2['freq'] * df2['entropy']
df2
```

Out[210]:

The values above are the entropy of the resulting sub-sets when you split on Cylinders. Notice that the entropy in 3 cylinder cars is zero. This is because there are only four observations in the dataset with 3 cylinder engine, and they all have miles per gallon below 25. If you know the cylinders to be 3, you have no uncertainty.

You will notice that for each subgroup, the entropy given the cylinder count is less in every case than before we did the split. This isn't to say that cylinders is the best feature to split on, just a demonstration of why Entropy is useful in creating a tree based model.

Enjoy this post? Sign up for our mailing list and don't miss any updates.