April 17, 2015

Natural Language Processing

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and the bag of words approach.

Feedback

I got some great feedback on this episode from Darryl McAdams (@psygnisfive). I wanted to include his comments in the show notes for future listeners.

For speech signal processing, certainly faster CPUs and even the emergence of GPU calculations have probably had the most impact. For natural language processing, however, I think memory and distributed computing have been more impactful. I'm reminded of the seminal work Scaling to Very Very Large Corpora for Natural Language Disambiguation by Brill and Banko at Microsoft Research which showed that large training corpora was more effective than better algorithms for natural language understanding. While certainly, CPU made their training faster, offline training can be patient so long as online recognition is fast. You'll notice they don't even mention training time on the axis of any of their figures.

Great correction, thank you! There was a glimmer of doubt for me as I said this, so I'm glad to have more knowledgable listeners to set the record straight.

Really interesting, too.

Enjoy this post? Sign up for our mailing list and don't miss any updates.


\n\n

For speech signal processing, certainly faster CPUs and even the emergence of GPU calculations have probably had the most impact. For natural language processing,\nhowever, I think memory and distributed computing have been more impactful. I'm reminded of the seminal work\nScaling to Very Very Large Corpora for Natural Language Disambiguation by Brill and Banko\nat Microsoft Research which showed that large training corpora was more effective than better algorithms for natural language understanding.\nWhile certainly, CPU made their training faster, offline training can be patient so long as online recognition is fast. You'll notice they\ndon't even mention training time on the axis of any of their figures.

\n\n

@DataSkeptic regarding "the", it's an article/determiner. regarding german, it's SOV (not OVS) by default, but has some extra stuff

— Darryl McAdams (@psygnisfive) April 18, 2015
\n\n\n

Great correction, thank you! There was a glimmer of doubt for me as I said this, so I'm glad to have more knowledgable listeners to set the record straight.

\n\n

@DataSkeptic additionally, OVS is one of the rarest word orders, and might not even exist, it's not clear

— Darryl McAdams (@psygnisfive) April 18, 2015
\n\n\n

Really interesting, too.

\n","date_discovered":"2015-04-17","last_rendered":"2016-11-19","publish_date":"2015-04-17"},"pagination":{"current":0,"next":1,"prev":0},"blog_focus":{"blog":{"prettyname":"/episodes/2015/natural-language-processing","ext":".Rhtml","guid":"5f98d606f25c9371356fd9d40bafad3f","c_hash":"02d873ca967c4be0ac5bf3c1296458f1","related":[],"author":"Kyle","uri":"dataskeptic.com/episodes/2015/natural-language-processing.Rhtml","env":"master","desc":"This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and the\nbag of words appro","rendered":"episodes/2015/natural-language-processing.htm","title":"Natural Language Processing","date_discovered":"2015-04-17","last_rendered":"2016-11-19","publish_date":"2015-04-17"},"loaded":1,"content":"

Natural Language Processing

\n\n

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and the\nbag of words approach.

\n\n

Feedback

\n

I got some great feedback on this episode from \nDarryl McAdams (@psygnisfive). I wanted to include his comments in the show notes for future listeners.

\n\n

@DataSkeptic i think in part it *is* the speed of the chips tho. if you can't process 10M hours of audio in reasonable time, you're hosed

— Darryl McAdams (@psygnisfive) April 18, 2015
\n\n\n

For speech signal processing, certainly faster CPUs and even the emergence of GPU calculations have probably had the most impact. For natural language processing,\nhowever, I think memory and distributed computing have been more impactful. I'm reminded of the seminal work\nScaling to Very Very Large Corpora for Natural Language Disambiguation by Brill and Banko\nat Microsoft Research which showed that large training corpora was more effective than better algorithms for natural language understanding.\nWhile certainly, CPU made their training faster, offline training can be patient so long as online recognition is fast. You'll notice they\ndon't even mention training time on the axis of any of their figures.

\n\n

@DataSkeptic regarding "the", it's an article/determiner. regarding german, it's SOV (not OVS) by default, but has some extra stuff

— Darryl McAdams (@psygnisfive) April 18, 2015
\n\n\n

Great correction, thank you! There was a glimmer of doubt for me as I said this, so I'm glad to have more knowledgable listeners to set the record straight.

\n\n

@DataSkeptic additionally, OVS is one of the rarest word orders, and might not even exist, it's not clear

— Darryl McAdams (@psygnisfive) April 18, 2015
\n\n\n

Really interesting, too.

\n","pathname":"/blog/episodes/2015/natural-language-processing","contributor":{"prettyname":"Kyle Polich","img":"https://s3.amazonaws.com/dataskeptic.com/contributors/kyle-polich.png","twitter":"@dataskeptic","linkedin":"https://www.linkedin.com/in/kyle-polich-5047193","bio":"Kyle studied computer science and focused on artificial intelligence in grad school. His general interests range from obvious areas like statistics, machine learning, data viz, and optimization to data provenance, data governance, econometrics, and metrology.","sort-rank":1}},"postLoading":false},"cart":{"invalid_submit":false,"cart_items":[],"paymentError":"","total":0,"country_long":"United States of America","invoice":{"submitDisabled":false,"paymentError":"","paymentComplete":false},"shipping":0,"stripeLoading":false,"go_to_checkout":false,"cart_visible":false,"country_short":"us","submitDisabled":false,"address":{"zip":"","city":"","phone":"","state":"","last_name":"","street_1":"","street_2":"","first_name":"","email":""},"prod":true,"focus":"first_name","token":null,"stripeLoadingError":false,"focus_msg":"","sizeSelected":{},"paymentComplete":false},"episodes":{"episodes":[],"loaded":false,"years":["2017","2016","2015","2014"],"focus_episode":{"episode":{"img":"https://static.libsyn.com/p/assets/2/9/3/8/2938570bb173ccbc/DataSkeptic-Podcast-1A.jpg","num":49,"guid":"5f98d606f25c9371356fd9d40bafad3f","pubDate":"2015-04-17T06:44:56.000Z","mp3":"http://traffic.libsyn.com/dataskeptic/nlp.mp3?dest-id=201630","desc":"

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and th bag of words approach.

","duration":"13:27","title":"[MINI] Natural Language Processing","link":"http://dataskeptic.com/epnotes/ep49_natural-language-processing.php"},"loaded":1}},"advertise":{"card":"
\n\t
\n\t\t
\n\t\t
\n\t\t

\n\t\t

Thanks to Periscope Data for sponsoring this week's episode of Data Skeptic.

\n\t\t

Please visit https://www.periscopedata.com/skeptics
\n\t\tto learn more about what you can do with their tools.

\n\t
\n
\n\n","banner":null},"products":{"products":[],"products_loaded":0},"player":{"is_playing":false,"has_shown":false,"playback_loaded":false,"position":0,"position_updated":false,"episode":{}},"contributors":{"list":{"kyle":{"prettyname":"Kyle Polich","img":"https://s3.amazonaws.com/dataskeptic.com/contributors/kyle-polich.png","twitter":"@dataskeptic","linkedin":"https://www.linkedin.com/in/kyle-polich-5047193","bio":"Kyle studied computer science and focused on artificial intelligence in grad school. His general interests range from obvious areas like statistics, machine learning, data viz, and optimization to data provenance, data governance, econometrics, and metrology.","sort-rank":1},"linhda":{"prettyname":"Linh Da Tran","img":"","twitter":"","bio":"Originally from North Carolina, Linhda graduated undergrad from UNC-Chapel Hill (Tarheels!) and promptly moved to the Golden Coast when she heard of sunnier days, fewer mosquitos and a long coastline of beaches. When she is not on the podcast, she enjoys commuting to work via bike, spending time with Yoshi, cooking then eating, lots of sleep and occasional yoga and making small-batch artisan ice cream. Her short stature and below average bike size has deterred many a LA bike thieves-- evidence that it pay off to be short.","sort-rank":1},"yoshi":{"prettyname":"Yoshi","img":"https://s3.amazonaws.com/dataskeptic.com/contributors/yoshi.gif","twitter":"","bio":"This Lilac-Crowned Amazon is mostly green all over with a patch if lilac and red on her forward. Her vocabulary ranges from \"I love you\" to \"you're a good girl, Yoshi!\" Pastimes include enjoying a misty bath time on the front patio, chewing on pieces of wood, and making a mess by flinging her dinner everywhere. In addition to being a frequent background commentator, she also inspires many of the topics for our mini-episodes.","sort-rank":1},"megan":{"prettyname":"Megan Ray Nichols","img":"https://s3.amazonaws.com/dataskeptic.com/contributors/megan-ray-nichols.jpg","twitter":"nicholsrmegan","bio":"[Megan Ray Nichols](https://about.me/megan-ray-nichols) is a freelance science writer and the editor of Schooled By Science, a blog dedicated to making science understandable to those without a science degree. She is also a regular contributor to The Energy Collective, Datafloq and Vision Times. Subscribe to [Schooled By Science](http://schooledbyscience.com/subscribe/) for the latest news and follow Megan on [Twitter](https://twitter.com/nicholsrmegan).","sort-rank":2},"jack-simpson":{"prettyname":"Jack Simpson","img":"https://s3.amazonaws.com/dataskeptic.com/contributors/jack-simpson.jpg","twitter":"jack_simpson","linkedin":"https://au.linkedin.com/in/jackbrucesimpson","bio":"Jack Simpson is completing a PhD in computational biology at the Australian National University in 2017. Over the course of his PhD, Jack has gained a keen interest in how machine learning can be used to solve problems in both research and industry. Jack is also passionate about science, programming and beekeeping. His personal blog can be found at [jacksimpson.co](http://www.jacksimpson.co). He also blogs about medical research on [biosky.co](http://biosky.co).","sort-rank":2},"kristine":{"prettyname":"Kristine de Leon","img":"https://s3.amazonaws.com/dataskeptic.com/contributors/kristen-de-leon.png","twitter":"deleonkrist","linkedin":"https://www.linkedin.com/in/kristine-de-leon-a7544149","bio":"Kristine is a fledgling science writer based in sunny Los Angeles, CA. Once a researcher in soil microbiology, Kristine is passionate about translating science into thrilling stories for all. She enjoys reading, the great outdoors, playing with logical systems, learning how stuff in the world works, and making things with metal.","sort-rank":2},"christine":{"prettyname":"Christine Zhang","img":"https://s3.amazonaws.com/dataskeptic.com/contributors/christine-zhang.png","twitter":"christinezhang","linkedin":"https://www.linkedin.com/in/christineyzhang/","bio":"Christine Zhang is a freelance journalist and data analyst who loves stats, stories, spreadsheets, and sandwiches. She was a 2016 OpenNews fellow at the Los Angeles Times Data Desk and has previously worked at the Brookings Institution in Washington, D.C.","sort-rank":2}}},"site":{"title":"Data Skeptic - The intersection of data science, artificial intelligence, machine learning, statistics, and scientific skepticism","disqus_username":"dataskeptic","contact_form":{"name":"","email":"","msg":"","error":"","send":"no"},"contributors":{},"slackstatus":"","schemaVersion":"v571"},"admin":{"body":"","from":"orders@dataskeptic.com","subject":"Hello from Data Skeptic","email_send_msg":"","order":{"size":"","color":"Black","customerName":"","quantity":"1","city":"","spError":"","step":"init","zipcode":"","state":"","country":"US","errorMsg":"","designId":"58196cb41338d457459d579c","address1":"","address2":""},"send_headers":"1","templates":[{"name":"Order confirmation","subject":"dataskeptic.com - order confirmed","body":"Hi {name},\n\nWe wanted to let you know that your order has processed and we'll send another confirmation shortly when it ships.\n\nThanks for your support,\n\nThe Data Skeptic team"},{"name":"Order shipped","subject":"dataskeptic.com - order shipped","body":"Hi {name},\n\nWe wanted to let you know that your recent order has shipped.\n\nThanks for your support,\n\nThe Data Skeptic team"},{"name":"Coaching renewing","subject":"dataskeptic.com - reminder of upcoming charge","body":"Hi {name},\n\nWe wanted to let you know that your monthly coaching plan will recur on {date}. If you have any questions or want to pause on our collaboration, you can reply to this email or reach out to Kyle directly. No action is needed on your part to continue.\n\nThanks!"}],"to":"kylepolich@gmail.com"},"layout":{"isMobileMenuVisible":false,"title":"Data Skeptic"},"checkout":{"error":"","success":"","processing":false},"proposals":{"loading":false,"aws_bucket":"","error":true,"proposal":{},"form":{"step":"INIT","error":{},"type":"TEXT","files":[],"recording":null,"submitted":false},"review":{"url":""},"aws_proposals_bucket":"dataskeptic-recording"},"recorder":{"id":"Undefined Record ID // error","chunkId":0,"startedAt":null,"isRecording":false,"isUploading":false,"duration":"00:00:00"},"form":{}};