Logistic Regression on Audio Data

Logistic Regression is a popular classification algorithm. In this episode we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of available features, which are spectral bands in the discussion on this episode.

The model takes this form:

$\text{log} \Bigg( \dfrac{p}{1-p} \Bigg) = \beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + ... + \beta_n \cdot X_n$

The algorithm uses maximum likelihood to find the optimal values for the parameters $\beta_i$. The left side of that equation uses the logistic function to transform the output, and a threshold is established to define the classification.

The figures below are referenced during the episode.

Keep an eye on the dataskeptic.com blog this week as we post more details about this project.

Data Science Assocation

This episode was sponsored by the Data Science Association. Sign up for their Data Science Conference Saturday, Feburary 18, 2017 by visiting dallasdatascience.eventbrite.com.

Supplemental music for this episode comes is Chris Zabriskie\'s Air Hockey Saloon.

The figures below are referenced in this week\'s episode.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import math
from scipy.fftpack import fft
import os
import scipy.io.wavfile as wav
x = np.arange(-6, 6, .01)
y = map(lambda v: 1.0 / (1.0 + math.exp(-1 * v)), x)
plt.title('Logistic curve', fontsize=18)
plt.xlabel('X value (any real number)', fontsize=16)
plt.ylabel('Logistic transformation [0, 1]', fontsize=16)
dname = '../../methods/2017/audio/who-speaking-raw/'
def show_waveform(fname, offset):
    rate, data = wav.read(fname)


The two plots below should look familiar to most readers. They are waveforms from audio of Linh Da and Kyle speaking. Kyle made Linh Da predict which is which in the episode. You can see from the filenames who the actual speaker is, but this test is a bit unfair as determining a speaker from waveforms, while possible, is a formitable challenge for typical speech.

show_waveform(dname + 'linhda.wav', 0)
show_waveform(dname + 'kyle.wav', 1)

Frequency Spectra

The plots below are samples of the frequency spectra for hosts Linh Da and Kyle. While you might not be able to explicitly tell who is who from these plots, can you determine that each is generated from a unique and distinct source? If so, your eye is discriminating enough to solve this problem, so we expect a logistic regression should be also.

def get_spectra(fname, start, stop, window):
    rate, data = wav.read(fname)
    bands = []
    for i in np.arange(start, stop, window):
        a = data[int(i*rate):int((i+.5)*rate)]
        b=[(ele/2**8.)*2-1 for ele in a]
        c = fft(b)
        d = len(c)/2
        f = abs(c[:(d-1)])
    nbands = np.array(bands)
    for nb in nbands:
        nbands[i] = map(lambda x: x**.5, nb)
    return nbands
lbands = get_spectra(dname + 'linhda.wav', 0, 800, .25)
kbands = get_spectra(dname + 'kyle.wav', 0, 800, .25)
plt.imshow(lbands.transpose(), cmap='cool', interpolation='nearest')
plt.imshow(kbands.transpose(), cmap='cool', interpolation='nearest')

Feature Engineering

From the transformed frequency data, we can bin ranges of values and use those numeric inputs as our features for the logistic regression model. Essentially, we are asking it to find the best $\beta_i$ each band $X_i$ such that when we do a linear combination (sum) of each band times its weight (the $\beta_i$ parameter), we can take that value, apply the logistic transformation, and then compare against a cut off point to do our classification.

Again, stay tuned to the dataskeptic.com blog this week for updates on this project and more details about these steps.