January 1, 2016

Kill the Word Cloud

This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.


Transcript:

Welcome to the first episode of Data Skeptic for 2016, and happy new year everyone. As I mentioned in last week's episode, shows that are released on major holidays have notably lower downloads. I thought it would be a dis-service to my upcoming guests to have them on for a day when I know the show will get less exposure. Yet, I also feel compelled to keep up my weekly format. So going forward, on major holidays (Thanksgiving, Christmas, New Years, and Arbor Day), I'll be releasing special episodes.

A while ago, I did a special episode called Proposing Annoyance Mining which featured me alone, standing on a soap box, which garnered some positive feedback. So I've decided to do something similar for this occasion, on a day when I presume only the truest of Data Skeptic fans are listening. So first off, thank you, for continuing to join me week after week, and I hope this break in format keeps your attention.

In fact, I'm actually glad that only my truest fans might be listening now, because I'm going to ask you all to become conspirators with me. I'm going to propose a murder of sorts, that I hope you'll help me execute. In fact, its more than a murder, I'm actually hoping for an extinction level event. Data Skeptic listeners, it's a new year, and a time that we use, culturally, to turn over new leaves, try new things, and put bad habits behind us. So I implore you to join me in committing homicide this year: let's kill the word cloud.

Just in case any listeners aren't clear on what a word cloud is, I'll give a brief description. A word cloud is a visualization of text data that shows a jumble of words or phrases in various sizes where the font size of the element indicates it's relative importance. Generally these are just a soup of terms in a roughly oval assemblage.

So why do people make word clouds? I think the answer is perhaps because it's easy. There exist simple, free tools to automate it, and admittedly, word clouds could be considered sort of pretty. But the truth is, they're really lazy. In my opinion, they convey the illusion of insight without providing any actual insight. I think they're one of the worst forms of data visualization possibly even less useful than a 3D donut chart.

Data Skeptic hasn't gotten too deep into data visualization topics. I'd like to recommend everyone check out the Data Stories podcast which focuses on data viz. It's one of my favorite shows, and complements Data Skeptic nicely. But let's take a moment and discuss at a high level what data visualization is all about. Data viz is the process of creating a graphic representation of data to convey a finding or communicate information more clearly and efficiently than presenting the data alone.

Why would a data scientist create a data visualization? Broadly, I would say there are two reasons. First, to help themself explore the data. Histograms and scatter plots quickly convey information that tabular data does not. When doing data exploration, knowing how to rapidly explore your data visually is the key to moving quickly. Second, data visualization is done to share a finding or insight with others. Thus, it's fair to say that the purpose of data visualization is to conveying information more clearly, quickly, or efficiently.

For a long time, I thought I understood what data viz was all about. Then I discovered Edward Tufte's The Visual Display of Quantitative Information, from which I discovered that my understanding of data viz had only just begun. So how would I apply the things I learned from Tufte in criticism of word clouds?

One of the first tenets put forward is "above all else, show the data". Ok, you could show all the words in a document in a word cloud. But is that the data? No. The data being displayed are the selection of words and phrases, but also a weight, presumably indicating their relative importance, expressed via font size in the data.

There's a seminal work by Cleveland and McGill's titled Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. It think it might not be terribly controversal to declare this paper this is to data viz what Newton's Principia is to physics. In it, the authors describe an experiment in which they present data in a variety of forms, ask readers to report the data they viewed, and measure the difference between the actual data and reported observations. You really should read this paper, but to summarize, the order from most to least effective they identified was to present your data using position, then length, angle and slope, area, volume, and lastly color. That's right, color is least effective of all. You hear that heat maps? You get a pass for this episode but you walk a fine line with me.

So how is the data conveyed in a word cloud? Well, its definitely not positional. There are two lengths present: height and width. Width is a product of the phrase and is therefore actually a bit mis-leading. Height correlates with but does not necessarily match the font size, so I suppose word clouds rely on length, in the form of the height. But let's review the others to be sure. Next comes area. Yes, I think area is also relevant, but due to the width issue, could do as much to mislead as it does to inform. This ambiguity highlights another flaw. The use of non-monospace fonts, typical kerning, and even typographical thicknesses all contribute the illusion of weight that may or may not be appropriate, being a product of the label, not the data.

But getting back to Tufte, he also suggests that one maximize the "data-ink ratio". For every pixel that isn't part of the background, determine if its providing the viewer with data, or not. The number of data conveying pixels divided by the total number of non-background pixels is the data-ink ratio, and one would like this to be as near 1.0 as possible.

There are some brilliant breakdowns in The Visual Display of Quantitative Information in which Tufte strips away the "chart junk" - elements that convey no data and often times only serve to distract, distort, or damage the data.

Tufte goes on to suggest one erase non-data ink and redundant data ink. That box around the legend? Useless, get rid of it. The legend itself, can we get rid of it? Maybe we could take those text labels and position them near the data they describe, removing the need for the legend altogether. Perhaps strategic positioning could further pull the reader's eye towards a specific aspect of a series you wish them to take notice of.

How does a word cloud score for the data-ink ratio? Hmm, I guess by this metric, its fine. Every pixel does convey information. However, I argue that while there might not be much non-data ink, there's a tremendous amount of redundant data ink. Ever word appearing in large type does seem wasteful. But I suppose the truest criticisms aren't necessarily rooted in Tufte's best practices. Instead, let's ask how effectively the reader of the word cloud is able to read and retain the underlying data.

For a moment, consider a photograph of a fulcrum weighing scales. That's the kind where two buckets or trays hang from a beam with a pole in the center of the beam. The image we often associate with courts. Picture on the left scale sits Yoshi, the lilac crowned amazon parrot and official Data Skeptic mascot. Picture on the right scale, a stack of apples, just enough to balance it. What data does this contain? It contains the weight of Yoshi as measured in units of apples. This is admittedly a unit no metrologist would approve of due to a lack of standard. It's also an unnecessarily complex way of sharing a single rational number. Yet, stay with me for the sake of argument.

Now, picture a similar set up, but instead of of apples to balance the weight, there are a few handfuls of gravel and woodchips. This visualization is notably inferior. The reader can count apples and likely has an intuitive and approximate sense of the weight of an apple. The gravel and woodchips are not easily countable nor is their weight intuitive to most readers.

Someone interested in data visualization should familiarize themself with the Weber-Fechner Law. It describes the relationship between the magnitude of a stimuli and the amount it must change for a typical human to notice it. In other words, if you have two dumbbells, one weighing 10 kilograms and the other weighing 10 + X kilograms, how big does X have to be before you can distinguish which dumbbell is heavier?

This law states that the just noticeable difference between two stimuli is proportional to the magnitude itself. Thus, if the X in our scenario is 0.1 kilograms, then the second weight has to be at least 10.1 kg before one can notice the difference, and proportionally, we expect a 0.2 kg change to notice the difference in a 20 kg weight.

I am unaware of any empirical measurement of the just noticeable difference for font size. If such a study exists, please point me in the right direction! If you conduct such a study, this is an open invitation to be a guest on Data Skeptic to present your findings. For fonts, in their typical unit of points, the general range of sizes seems to be about 8 points to 72 points. Based on completely anecdotal findings in a non-blinded study of n=1 (i.e. testing on myself), I found I could not reliably distinguish between text of 68 point and 72 point type, implying that for 68 point type, the just noticeable difference is a 5 point change. Your mileage may vary.

Perhaps more significantly, the random positioning and rotation of values in a word cloud confound the problem of visual comparison, and likely increasing the just noticeable difference, and therefore reducing the practical resolution of the visualization due to the Weber-Fechner Law.

A good data visualization is not a picture, it is a sentence written in the language of data. We can therefore empirically compare two competing visualizations, asking readers to report the exact values they observe in the image. We can measure the precision and variance of their responses and quantitatively label a winner. And in my mind, the data one could render in a word cloud would be better served in just about any other honest presentation of the data by these metrics.

Now, I'm open to one counterargument for the word cloud. Art. If you want to claim that the arrangement of words of various sizes and colored in various ways has artistic merit, I can be ambivalent. I may not understand or appreciate it as art, but despite Linhda and I being LACMA members, there's plenty of art I don't understand.

To me, the word cloud fails achieve the desirable qualities of good data visualization I mentioned previously. But what can be done with the data? What's a more ideal way of representing this data? Let's consider what we have. We're given the output of some algorithm which returns a select list of words or phrases, and a weight of some kind on each value, let's presume normalized between 0 and 1. How did we arrive at this data? No idea. How do we interpret the weights? Ambiguous. This is a red flag for me, in general. If I don't know the precise meaning and provenance of data, I'm skeptical. But let's set that aside and talk about how to visual the phrases and their weights.

I propose a simple histogram bar plot [as pointed out by a commentor, what I am describing is not actually a histogram]. We have phrases that the reader should read, so let's make it a horizontal bar chart, so the text is more natural to read. Let's order them descending so the visualization includes implicit ordinal comparisons, prioritizes the most noteworthy phrases for first reading, and conveys a sense of the underlying distribution of weight over phrases. Will the important phrases be normally distributed? Log normally distributed? With the histogram bar plot approach, the distribution is shown to the reader, but with the word cloud its unnecessarily obscured.

Recognizing our histogram bar plot presentation might grow a bit too vertical, perhaps we chop off a long tail and report "everything else" as a final bar, preferably in a different shade or color to highlight its heterogeneous nature.

This simple visualization embeds the same textual data, a cleaner read of the associated weights, an ordinal dimension, a sense of the distribution, and a better description of the cardinality. All in all, I'm comfortably calling it superior presentation in every possible way to a word cloud. So would you please consider this, or frankly, anything else, the next time you're presenting data you might have otherwise made into a word cloud? I've put some code samples in the show notes describing what I did above. If you'd like to use them, please feel free.

So in conclusion, what are my primary objections to word clouds? They convey much less data than a more straightforward presentation would. I believe they unnecessarily obfuscate the data by leveraging font size and random, meaningless positional data. Most importantly, they're difficult to read and convey the illusion of insight.

Ok Data Skeptic listeners. It's 2016. There are better ways. Say it with me now, everyone, out loud. Whether your alone in your car, out hiking in the woods, or on a crowded subway car. Ok, well, maybe don't say it with me if you're in the subway car or a generally public place. But everyone else, all at one, shout it out with me: It's 2016, let's kill the word cloud.

Thanks for joining me listeners new and old. I'm extremely excited for what we've got coming up on the show this year. Its an election year in the United States, and we're going to cover topics related to the data of elections. We're going to explore an event that happened recently which sent my wife to the ER, and the data surrounding it. And for those of you that are counting, we'll hit episode 2^7. In episode 2^6 I attempted to calculate the probability, however small, that bigfoot might be real. Our next Data Skeptic investigation is already underway. Beyond that you can expect the same high caliber guests and mini-episodes in your podcast feed every week and perhaps even a few video segments and live events.

And if you've got feedback on the "Kill the Word Cloud" initiative, I invite everyone to share comments and criticisms in the discussion forum for this episode found at dataskeptic.com. Our regular programming resumes next week with a mini episode on gradient descent. Until then, this is Kyle Polich for dataskeptic.com reminding everyone to keep thinking skeptically of and with data throughout all of 2016 and beyond.

In [1]:
import random
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# Create some data to visualize

tokens = [
    'data skeptic', 'machine learning', 'statistics', 'optimization', 'artificial intelligence',
    'podcast', 'interviews', 'mini-episodes', 'tutorials', 'expert perspectives',
    'logistic regression', 'random forest', 'support vector machines', 'deep learning', 'clustering'
]
weights = []
w = 1.0
for i in range(len(tokens)):
    discount = 1.0 - random.random()/15
    w *= math.pow(discount, i)
    weights.append(w)

df = pd.DataFrame({'token': tokens, 'weight': weights})
In [3]:
# Trim the list to a maximum

max_tokens = 10
df.sort('weight', inplace=True, ascending=False)
if df.shape[0] > max_tokens:
    tail = df.ix[max_tokens:, 'weight'].sum()
    df2 = df.iloc[0:max_tokens].copy()
    nrow = pd.Series({'token': '--[other terms]--', 'weight': tail})
    df2.ix[df2.shape[0]] = nrow
    df2.index = np.arange(df2.shape[0])
In [4]:
# Review the data frame to display
df2
Out[4]:
token weight
0 data skeptic 1.000000
1 machine learning 0.960604
2 statistics 0.900030
3 optimization 0.828250
4 artificial intelligence 0.729493
5 podcast 0.623012
6 interviews 0.543877
7 mini-episodes 0.447382
8 tutorials 0.277961
9 expert perspectives 0.150091
10 --[other terms]-- 0.397052
In [5]:
# Let's plot something better than a word cloud!

plt.figure(figsize=(10,5))
colors = []
for c in range(df2.shape[0]-1):
    colors.append('#383838')
colors.append('#ebe728')
plt.barh(df2.index * -1, df2['weight'], color=colors)
plt.gca().yaxis.grid(False)
plt.yticks(df2.index * -1 + 0.4, df2['token'])
plt.ylim(-1 * df2.shape[0] + 0.8, 1)
plt.xlabel('phrase weight')
plt.show()
Enjoy this post? Sign up for our mailing list and don't miss any updates.

Have a word to say? Propose a specific change to the blog post.