--:--

--:--

A follow up discussion of this result appears in episode #74: Shakespeare, Abiogenesis, and Exoplanets. I highly recommend you check that out in addition to or in place of this episode.

What is randomness? How can we determine if some results are randomly generated or not? Why are random numbers important to us in our everyday life? These topics and more are discussed in this mini-episode on random numbers.

Many readers will be vaguely familar with the idea of “X number of monkeys banging on Y number of typewriters for Z number of years” - the idea being that such a setup would produce random sequences of letters. The origin of this idea was the mathemetician Borel who was interested in whether or not 1,000,000 monkeys working for 10 hours per day might eventually reproduce the works of shakespeare.

Let us define \(SH\) to be the “Shakespeare Hypothesis” - the assertion that our monkeys might coincidentally reproduced the complete works of shakespeare at some point. Let us further assume that the works appear in sequence in the order they were completed, lest we introduce an additional combinatorical problem of how likely the monkeys are to produce a the works in any order.

There are 884,421 total works in the complete works of shakespeare, and I’ll estimate that on average, words have 5.1 of the 26 characters of the alphabet. Thus…

\(Pr(SH) = \dfrac{1}{26^{884421 (5.1)}} = \dfrac{1}{26^{430547.1}} = \dfrac{1}{10^{10^{3.788584}}}\)

Let us first assume trained monkeys will output random keystrokes at a rake of 80 per minute. This is on the higher end of typical typists, although they’re trying to type deliberately which is probably slower than random, so this is probably conservative.

Multiply that by 5.1 characters per word, 60 minutes per hour, 10 hours per day, and 1 million monkeys gives us…

\(CharactersPerDay = 80 \cdot 120 \cdot 5.1 \cdot 60 \cdot 10 \cdot 1000000 = 2.9376 × 10^{13}\)

Wow, that’s alot of random text! You could fit the complete works into that \(3.3215 x 10^7\) times per day if you typed it out perfectly. However, the monkeys will not deliberately type it, we’re wondering if they might coincidentally do it. As we showed above (when defining \(Pr(SH)\)), we’re going to have to be exceptionally lucky, so we’re going to need alot of failed attempts.

In other words, if the entire works of Shakespeare were “abcde” (5 characters long) and the monkeys banged out “weiojskabfabcdadfjabcdeadfff” (28 characters long), there are 23 possible sequences of length 5 - one of which does happen to be “abcde” in my convoluted example.

So if we assume that work from previous days carries over in one long sequential stream of gibberish, then after the first 884421th character is recorded, every subsequence starting from the beginning is fair game to match the complete works.

I’ll now make an important assumption: monkeys banging on typewriters output each letter with equal probability. In other words, they don’t write “B” more than \(\frac{1}{26}\) of the time in show of their love of bananas. For simplicity, I’ll also assume we’re not worried about matching the case of the letters.

Given this important assumption, we can now express the problem using
the Binomial probability distribution where \(p=Pr(SH)=26 ^{-430547.1}\). I’m going to
approache the problem with a step that non-statisticians might find
counter intuitive. I’m going to ask the question “what’s is the
probability of __not__ generating the entire works of shakespeare
given enough monkey banging?”. This is easier to compute than summing
the probability they do it one, twice, thrice, etc.

Frustratingly, these extreme valued numbers cannot be handled by most statistical software and R (the language the show notes are written in) is no different. So, let’s go back to basics and see about computing this by hand. The binomial distriution has this form:

\(Pr(X=k) = {n \choose k} p^k \big(1 -p\big)^{(n-k)}\)

For our example, \(k=0\), \(n\) is the number of trials (which we will want to play with), and \(p\) is the probability of success on a single trial. Thus, we want to know:

$1 - Pr(X=0) = 1 - {n } p^0 (1 -p )^{(n-0)} \ = 1 - (1 - 26 ^{-430547.1} )^{n} \ = 1 - ( )^{n} $

An exact solution is still ellusive, but we can see that it is some very tiny value \(\epsilon\). Thus, if we are interested in finding the \(n\) required to generate Shakespeare’s works with \(q=0.5\) probability, then…

\(log_{\epsilon}(0.5) = n\)

So even, if we don’t have an exact solution, let’s get something with pretty good precision, perhaps \(\epsilon=0.999999999999\). We’ll then divide by the Characters Per Day previous calculated to determine the likelihood of success in a single day.

So given this overly generous \(\epsilon\), there’s a % chance that one million monkeys on a million typewriters would reproduce the works of shakespeare in a single day.