Natural Language Processing

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and the bag of words approach.

Feedback

I got some great feedback on this episode from Darryl McAdams (@psygnisfive). I wanted to include his comments in the show notes for future listeners.

For speech signal processing, certainly faster CPUs and even the emergence of GPU calculations have probably had the most impact. For natural language processing, however, I think memory and distributed computing have been more impactful. I'm reminded of the seminal work Scaling to Very Very Large Corpora for Natural Language Disambiguation by Brill and Banko at Microsoft Research which showed that large training corpora was more effective than better algorithms for natural language understanding. While certainly, CPU made their training faster, offline training can be patient so long as online recognition is fast. You'll notice they don't even mention training time on the axis of any of their figures.

Great correction, thank you! There was a glimmer of doubt for me as I said this, so I'm glad to have more knowledgable listeners to set the record straight.

Really interesting, too.

Enjoy this post? Sign up for our mailing list and don't miss any updates.