Today\'s blog post is a hands-on demo and evaluation of the Google Cloud Natural Language API. Like all cherry picked examples/demos, their site shows an impressive result in the system\'s ability to do entity identification, sentiment analysis, and extract syntax. How will the results be on a real world dataset?
As I\'ve done previously, I\'m going to use the episode short descriptions found in the Data Skeptic RSS feed. At times, we\'ve done extensive write ups, while for other episodes, these are terse and written at the last minute. While we\'re striving to improve these with time, the older ones are admittedly not as good. Let\'s keep that in mind as we evaluate.
This API allows a fairly generous free tier of 5k requests per month. Requests beyond this are cheap enough that a graduate student could easily afford using it for their research. At the time of writing this, it\'s $0.01 per use from 5k-1M requests, and progressively discounted after that.
You will, however, need to take a few steps including enabling billing, as description in the Quickstart.
While there are some Python wrapper libraries, I found them infuriatingly difficult to install and use. I spent more time trying to get it working than to just code it directly via the RESTful API, so that\'s what I did in the code below.
%matplotlib inline
import matplotlib.pyplot as plt
import os
import requests
import xmltodict
import json
import pandas as pd
import pickle
from bs4 import BeautifulSoup
# Get all the descriptions
fname = 'feed.xml'
url = 'http://dataskeptic.com/feed.rss'
if not(os.path.isfile(fname)):
r = requests.get(url)
f = open(fname, 'wb')
f.write(r.text.encode('utf-8'))
f.close()
with open(fname) as fd:
xml = xmltodict.parse(fd.read())
episodes = xml['rss']['channel']['item']
descriptions = []
descToTitle = {}
descToNum = {}
l = len(episodes)
for episode in episodes:
enclosure = episode['enclosure']
desc = episode['description']
desc = BeautifulSoup(desc, "lxml").text
descriptions.append(desc)
descToTitle[desc] = episode['title']
descToNum[desc] = l
l = l - 1
print(len(descriptions))
sentence = 'Data Skeptic is a podcast hosted by Kyle Polich.'
api_key = json.load(open('gcloud-lang-api-key.json', 'r'))['api_key']
url = "https://language.googleapis.com/v1/documents:analyzeEntities?key={api_key}".format(api_key=api_key)
payload = {
"encodingType": "UTF8",
"document": {
"type": "PLAIN_TEXT",
"content": sentence
}
}
r = requests.post(url, data=json.dumps(payload))
r.status_code
json.loads(r.content.decode('utf-8'))
Looks good! It identified two key ideas in this sentence. Granted, a simple rule looking at capitalization would have been good enough to recognize the two important entities in the sentence. It is yet to be seen how well the API performs identifying entities since this is an easy test. Yet, notice how Data Skeptic is labeled as a "Work of Art" (thanks Google!) and Kyle Polich is labeled as a person, which only a few people have doubted to be true.
Now, let\'s retrieve all the results for the podcast descriptions.
results = {}
for desc in descriptions:
if desc not in results:
payload = {
"encodingType": "UTF8", "document": {
"type": "PLAIN_TEXT", "content": desc }
}
r = requests.post(url, data=json.dumps(payload))
if r.status_code == 200:
results[desc] = json.loads(r.content.decode('utf-8'))
else:
print("ERROR with desc: " + desc)
pickle.dump(results, open('results.pickle', 'wb'))
# In the next section we'll use this function to extact data into a more useful data structure
def extract_entities_as_rows(r):
rows = []
entities = r['entities']
for entity in entities:
mentions = entity['mentions']
if 'wikipedia_url' in entity['metadata']:
wiki = entity['metadata']['wikipedia_url']
else:
wiki = ''
row = {
"num_mentions": len(entity),
"wiki_url": wiki,
"name": entity['name'],
"salience": entity['salience'],
"type": entity['type']
}
rows.append(row)
return rows
Let\'s take a totally arbitrary example and expore it. I picked the episode we did on PageRank. Below are the entities that were extracted, sorted by salience descending. Ideally, the most important/relevant concepts are ranked near the top. It definitely gets PageRank correct as the top concept.
Beyond that the ordering is a bit mediocre, yet it\'s interesting to observe that the wiki_url
is well mapped. I noticed that "search engine" which could have been linked to some very generic wikipedia page is not. I see this as a win. Of the four Wiki links provided, this anacdote has a 100% match rate. This gets me thinking about what I might do in the future taking these matches and linking them to the Wikipedia Knowledge Graph.
key = list(results.keys())[0]
print(key)
pd.DataFrame(extract_entities_as_rows(results[key])).sort_values('salience', ascending=False)
rows = []
for key in results.keys():
r = results[key]
rows.extend(extract_entities_as_rows(r))
df = pd.DataFrame(rows)
df.sort_values('salience', ascending=False, inplace=True)
df.groupby(['type'])['name'].count().sort_values().plot(kind='barh')
plt.show()
Below are the top and bottom 10 matches ranked by salience. Ideally, the salience score is highly correlated with my notion of relevance. If so, I should be able to identify a threshold above which I trust this service, and a threshold below which I discard results. I\'ll leave you to judge for yourself based on the edge cases below.
df.head(10)
df.tail(10)