This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and the bag of words approach.
I got some great feedback on this episode from Darryl McAdams (@psygnisfive). I wanted to include his comments in the show notes for future listeners.
@DataSkeptic i think in part it *is* the speed of the chips tho. if you can't process 10M hours of audio in reasonable time, you're hosed
— Darryl McAdams (@psygnisfive) April 18, 2015
For speech signal processing, certainly faster CPUs and even the emergence of GPU calculations have probably had the most impact. For natural language processing, however, I think memory and distributed computing have been more impactful. I'm reminded of the seminal work Scaling to Very Very Large Corpora for Natural Language Disambiguation by Brill and Banko at Microsoft Research which showed that large training corpora was more effective than better algorithms for natural language understanding. While certainly, CPU made their training faster, offline training can be patient so long as online recognition is fast. You'll notice they don't even mention training time on the axis of any of their figures.
@DataSkeptic regarding "the", it's an article/determiner. regarding german, it's SOV (not OVS) by default, but has some extra stuff
— Darryl McAdams (@psygnisfive) April 18, 2015
Great correction, thank you! There was a glimmer of doubt for me as I said this, so I'm glad to have more knowledgable listeners to set the record straight.
@DataSkeptic additionally, OVS is one of the rarest word orders, and might not even exist, it's not clear
— Darryl McAdams (@psygnisfive) April 18, 2015
Really interesting, too.