--:--

--:--

As announced on the recent update posted to the podcast feed, The Data Skeptic Podcast is launching a contest- not one of chance, but one of skill. Listeners are encouraged to put their data science skills to good use, or if all else fails, guess!

The contest works as follows. Below is some data about the cumulative number of downloads the podcast has achieved on a few given dates. Your job is to predict the date and time at which the podcast will recieve download number 27,182. Why this arbitrary number? It’s as good as any other arbitrary number!

Use whatever means you want to formulate a prediction. Once you have it, wait until that time and then post a review of the Data Skeptic Podcast on iTunes. You don’t even have to leave a good review! The review which is posted closest to the actual time at which this download occurs will win a free copy of Matthew Russell’s “Mining the Social Web” courtesy of the Data Skeptic Podcast. “Price is Right” rules are in play - the winner is the person that posts their review closest to the actual time without going over.

The plot below reveals the four data points you are being given to help inform your prediction.

What type of curve do you think best fits this data? Linear?
Exponential? Logarithmic? Let’s experiment with linear for an example.
The plot above expresses the data with time on the x axis. Since we’re
actually trying to *predict* the time for a known number of
downloads (y axis), let’s invert this chart for the next few steps.

If this linear fit is correct, it suggests the winning time to post a review is . But might some other curve fit better? The eye naturally looks like there’s some exponential growth here. So perhaps we should try to take the logarithm of the number of downloads so that the proposed exponential growth is transformed into a linear relationship.

By this analysis, we find the goal reached at . Not true, that date is passed! What’s gone wrong here? Should you give up or try a different line of inquiry? This model is clearly a bad one, but it does teach us something important. The growth rate is upper bounded by this log-linear regression - a good model will be more conservative than this.

On the podcast I suggested perhaps exploring another type of approach. The total number of downloads could be decomposed into two separate phenomenon. First, there is always a spike in downloads on the day of a new episode being released. Second, there is a “background noise” of daily downloads that are likely generated by new listeners, casual dabblers, and subscribers exploring the back catalog. Might an equation like this work?

\(D(t) = m_b \cdot t + m_r \cdot \Big\lfloor \frac{t}{7} \Big\rfloor + b\)

where \(D\) is the total downloads, \(t\) is the day, \(m_b\) is the rate of background downloads, \(m_r\) is the rate of downloads on release days, and \(b\) is the y intercept.