December 30, 2016
The Library Problem
We close out 2016 with a discussion of a basic interview question which might get asked when applying for a data science job. Specifically, how a library might build a model to predict if a book will be returned late or not.View More
December 23, 2016
2016 Holiday Special
Today's episode is a reading of Isaac Asimov's Franchise.  As mentioned on the show, this is just a work of fiction to be enjoyed and not in any way some obfuscated political statement.  Enjoy, and happy holidays!View More
December 16, 2016
[MINI] Entropy
Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it's a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding whether or not Yoshi the parrot will like a new chew toy. A few other everyday examples help us examine why entropy is a nice metric for constructing a decision tree.View More
December 9, 2016
MS Connect Conference
Cloud services are now ubiquitous in data science and more broadly in technology as well. This week, I speak to Mark Souza, Tobias Ternström, and Corey Sanders about various aspects of data at scale. We discuss the embedding of R into SQLServer, SQLServer on linux, open source, and a few other cloud topics.View More
December 2, 2016
Causal Impact
Today's episode is all about Causal Impact, a technique for estimating the impact of a particular event on a time series. We talk to William Martin about his research into the impact releases have on app and we also chat with Karen Blakemore about a project she helped us build to explore the impact of a Saturday Night Live appearance on a musician's career.View More
November 25, 2016
[MINI] The Bootstrap
The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random Forest. We discuss this technique related to polling and surveys.View More
November 18, 2016
[MINI] Gini Coefficients
The Gini Coefficient (as it relates to decision trees) is one approach to determining the optimal decision to introduce which splits your dataset as part of a decision tree. To pick the right feature to split on, it considers the frequency of the values of that feature and how well the values correlate with specific outcomes that you are trying to predict.View More
November 11, 2016
Unstructured Data for Finance
Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the potential of unstructured data and discusses her work analyzing Wikipedia to help inform financial decisions.View More
November 4, 2016
[MINI] AdaBoost
AdaBoost is a canonical example of the class of AnyBoost algorithms that create ensembles of weak learners. We discuss how a complex problem like predicting restaurant failure (which is surely caused by different problems in different situations) might benefit from this technique.View More
October 28, 2016
Stealing Models from the Cloud
Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered?View More
October 21, 2016
[MINI] Calculating Feature Importance
For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing a feature and measuring the decrease in accuracy or Gini values in the leaves. We broadly discuss these techniques in this episode.View More
October 14, 2016
NYC Bike Share Rebalancing
As cities provide bike sharing services, they must also plan for how to redistribute bicycles as they inevitably build up at more popular destination stations. In this episode, Hui Xiong talks about the solution he and his colleagues developed to rebalance bike sharing systems.View More
October 7, 2016
[MINI] Random Forest
Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.View More
September 30, 2016
Election Predictions
Jo Hardin joins us this week to discuss the ASA's Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo's blog post found here.View More
September 23, 2016
[MINI] F1 Score
The F1 score is a model diagnostic that combines precision and recall to provide a singular evaluation for model comparison.  In this episode we discuss how it applies to selecting an interior designer.View More
September 16, 2016
Urban Congestion
Urban congestion effects every person living in a city of any reasonable size. Lewis Lehe joins us in this episode to share his work on downtown congestion pricing. We explore topics of how different pricing mechanisms effect congestion as well as how data visualization can inform choices.View More
September 9, 2016
[MINI] Heteroskedasticity
Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range.  For example, the variance in the length of a cat's tail almost certainly changes (grows) with age.  On the other hand, the average amount of chewing gum a person consume probably has a consistent variance over a wide range of human heights.View More
September 2, 2016
Music21
Our guest today is Michael Cuthbert, an associate professor of music at MIT and principal investigator of the Music21 project, which we focus our discussion on today.View More
August 26, 2016
[MINI] Paxos
Paxos is a protocol for arriving a consensus in a distributed computing system which accounts for unreliability of the nodes.  We discuss how this might be used in the real world in the event of a massive disaster.View More
August 19, 2016
Trusting Machine Learning Models with LIME
Machine learning models are often criticized for being black boxes. If a human cannot determine why the model arrives at the decision it made, there's good cause for skepticism. Classic inspection approaches to model interpretability are only useful for simple models, which are likely to only cover simple problems.View More
August 12, 2016
[MINI] ANOVA
Analysis of variance is a method used to evaluate differences between the two or more groups.  It works by breaking down the total variance of the system into the between group variance and within group variance.  We discuss this method in the context of wait times getting coffee at Starbucks.View More
August 5, 2016
Machine Learning on Images with Noisy Human-centric Labels
When humans describe images, they have a reporting bias, in that the report only what they consider important. Thus, in addition to considering whether something is present in an image, one should consider whether it is also relevant to the image before labeling it.View More
July 29, 2016
[MINI] Survival Analysis
Survival analysis techniques are useful for studying the longevity of groups of elements or individuals, taking into account time considerations and right censorship. This episode explores how survival analysis can describe marriages, in particular, using the non-parametric Cox proportional hazard model.View More
July 22, 2016
Predictive Models on Random Data
This week is an insightful discussion with Claudia Perlich about some situations in machine learning where models can be built, perhaps by well-intentioned practitioners, to appear to be highly predictive despite being trained on random data. Our discussion covers some novel observations about ROC and AUC, as well as an informative discussion of leakage.View More
July 15, 2016
[MINI] Receiver Operating Characteristic (ROC) Curve
An ROC curve is a plot that compares the trade off of true positives and false positives of a binary classifier under different thresholds. The area under the curve (AUC) is useful in determining how discriminating a model is. Together, ROC and AUC are very useful diagnostics for understanding the power of one's model and how to tune it.View More
July 8, 2016
Multiple Comparisons and Conversion Optimization
I'm joined by Chris Stucchio this week to discuss how deliberate or uninformed statistical practitioners can derive spurious and arbitrary results via multiple comparisons. We discuss p-hacking and a variety of other important lessons and tips for proper analysis.View More
July 1, 2016
[MINI] Leakage
If you'd like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the future in our training data when building machine learning models. Similarly, if any other feature whose value would not actually be available in practice at the time you'd want to use the model to make a prediction, is a feature that can introduce leakage to your model.View More
June 24, 2016
Predictive Policing
Kristian Lum (@KLdivergence) joins me this week to discuss her work at @hrdag on predictive policing. We also discuss Multiple Systems Estimation, a technique for inferring statistical information about a population from separate sources of observation.View More
June 17, 2016
[MINI] The CAP Theorem
Distributed computing cannot guarantee consistency, accuracy, and partition tolerance. Most system architects need to think carefully about how they should appropriately balance the needs of their application across these competing objectives. Linh Da and Kyle discuss the CAP Theorem using the analogy of a phone tree for alerting people about a school snow day.View More
June 10, 2016
Detecting Terrorists with Facial Recognition?
A startup is claiming that they can detect terrorists purely through facial recognition. In this solo episode, Kyle explores the plausibility of these claims.View More
June 3, 2016
[MINI] Goodhart's Law
Goodhart's law states that "When a measure becomes a target, it ceases to be a good measure". In this mini-episode we discuss how this affects SEO, call centers, and Scrum.View More
May 27, 2016
Data Science at eHarmony
I'm joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships.View More
May 20, 2016
[MINI] Stationarity and Differencing
Mystery shoppers and fruit cultivation help us discuss stationarity - a property of some time serieses that are invariant to time in several ways. Differencing is one approach that can often convert a non-stationary process into a stationary one. If you have a stationary process, you get the benefits of many known statistical properties that can enable you to do a significant amount of inferencing and prediction.View More
May 13, 2016
Feather
I'm joined by Wes McKinney (@wesmckinn) and Hadley Wickham (@hadleywickham) on this episode to discuss their joint project Feather. Feather is a file format for storing data frames along with some metadata, to help with interoperability between languages. At the time of recording, libraries are available for R and Python, making it easy for data scientists working in these languages to quickly and effectively share datasets and collaborate.View More
May 6, 2016
[MINI] Bargaining
Bargaining is the process of two (or more) parties attempting to agree on the price for a transaction.  Game theoretic approaches attempt to find two strategies from which neither party is motivated to deviate.  These strategies are said to be in equilibrium with one another.  The equilibriums available in bargaining depend on the the transaction mechanism and the information of the parties.  Discounting (how long parties are willing to wait) has a significant effect in this process.  This episode discusses some of the choices Kyle and Linh Da made in deciding what offer to make on a house.View More
April 29, 2016
deepjazz
Deepjazz is a project from Ji-Sung Kim, a computer science student at Princeton University. It is built using Theano, Keras, music21, and Evan Chow's project jazzml. Deepjazz is a computational music project that creates original jazz compositions using recurrent neural networks trained on Pat Metheny's "And Then I Knew". You can hear some of deepjazz's original compositions on soundcloud.View More
April 22, 2016
[MINI] Auto-correlative functions and correlograms
When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to recent preceding observations. A very random process (like lottery numbers) would show very low values, while temperature (our topic in this episode) does correlate highly with recent days.     Below is a time series of the weather in Chapel Hill, NC every morning over a few years. You can clearly see an annual cyclic pattern, which should be no surprise to anyone. Yet, you can also see a fair amount of variance from day to day. Even if you de-trend the annual cycle, we can see that this would not be enough for yesterday's temperature to perfectly predict today's weather.       Below is a correlogram of the ACF (auto-correlative function). For very low values of lag (comparing the most recent temperature measurement to the values of previous days), we can see a quick drop-off. This tells us that weather correlates very highly, but decliningly so, with recent days.View More
April 15, 2016
Early Identification of Violent Criminal Gang Members
This week I spoke with Elham Shaabani and Paulo Shakarian (@PauloShakASU) about their recent paper Early Identification of Violent Criminal Gang Members (also available onarXiv). In this paper, they use social network analysis techniques and machine learning to provide early detection of known criminal offenders who are in a high risk group for committing violent crimes in the future. Their techniques outperform existing techniques used by the police. Elham and Paulo are part of the Cyber-Socio Intelligent Systems (CySIS) Lab.View More
April 8, 2016
[MINI] Fractional Factorial Design
A dinner party at Data Skeptic HQ helps teach the uses of fractional factorial design for studying 2-way interactions.View More
April 1, 2016
Machine Learning Done Wrong
Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, and possibly some good reminders for experts as well. Our discussion parallels his recent blog postMachine Learning Done Wrong.View More
March 25, 2016
Potholes
Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles's open data portal.View More
March 18, 2016
[MINI] The Elbow Method
Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem?View More
March 11, 2016
Too Good to be True
Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True. This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken. With large enough data, some amount of error is expected.View More
March 4, 2016
[MINI] R-squared
How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of accuracy, it's a measure of the power of one's model.View More
February 26, 2016
Models of Mental Simulation
    Jessica Hamrick joins us this week to discuss her work studying mental simulation. Her research combines machine learning approaches iwth behavioral method from cognitive science to help explain how people reason and predict outcomes. Her recent paper Think again? The amount of mental simulation tracks uncertainty in the outcome is the focus of our conversation in this episode.View More
February 19, 2016
[MINI] Multiple Regression
This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.View More
February 12, 2016
Scientific Studies of People's Relationship to Music
Samuel Mehr joins us this week to share his perspective on why people are musical, where music comes from, and why it works the way it does. We discuss a number of empirical studies related to music and musical cognition, and dispense a few myths about music along the way.View More
February 5, 2016
[MINI] k-d trees
This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most of the episode talking about binary search before getting into k-d trees, but this is a necessary prerequisite.View More
January 29, 2016
Auditing Algorithms
Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outside the intent of the system designers and programmers). Christian Sandvig joins us in this episode to talk about his work and the concept of auditing algorithms.View More
January 22, 2016
[MINI] The Bonferroni Correction
Today's episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let's consider eye color, hair color, favorite ska band, most recent grocery store used, and any number of characteristics could be studied to look for deviations from the norm in a company.View More
January 15, 2016
Detecting Pseudo-profound BS
A recent paper in the journal of Judgment and Decision Making titled On the reception and detection of pseudo-profound bullshit explores empirical questions around a reader's ability to detect statements which may sound profound but are actually a collection of buzzwords that fail to contain adequate meaning or truth. These statements are definitively different from lies and nonesense, as we discuss in the episode.View More
January 8, 2016
[MINI] Gradient Descent
Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.View More
January 1, 2016
Let's Kill the Word Cloud
This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.View More