After our recent episode with Peter Backus, I wanted to learn more about Elo ratings in chess. The results of his research showed an effect size of equivalent to a different of 30 Elo points. How significant is that? To answer the question, I needed to get some data to explore. I found chessgames.com to be a good resource and below is how I crawled the data.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import requests
import BeautifulSoup as soup
import os
import numpy as np
import time
os.makedirs('./dl/')
url = 'http://www.chessgames.com/directory/'
The site organizes players with alphabetical pages, so I first crawl each of those in the code block below.
start = 65
raw_pages = []
for i in range(26):
letter = chr(start + i)
page = url + letter + '.html'
fname = './dl/' + letter + '.html'
if not(os.path.isfile(fname)):
r = requests.get(page)
time.sleep(1)
f = open(fname, 'w')
f.write(r.text.encode('ascii', 'ignore'))
r.close()
#
f = open(fname, 'r')
lines = f.readlines()
f.close()
raw_pages.append(' '.join(lines))
The next two code blocks parse those pages and extract the data available.
dfs = []
for body in raw_pages:
b = soup.BeautifulSoup(body)
rows = b.findAll('tr')
prows = []
for row in rows:
cells = len(row.findAll('td'))
if cells == 5:
prows.append(row)
ratings = []
names = []
yearss = []
gamess = []
for prow in prows:
cells = prow.findAll('td')
try:
rating = int(cells[0].text.replace(' ', ''))
except ValueError:
rating = -1
name = cells[2].text.replace(' ', '')
years = cells[3].text.replace(' ', '')
games = cells[4].text.replace(' ', '')
ratings.append(rating)
names.append(name)
yearss.append(years)
gamess.append(games)
df = pd.DataFrame({'rating': ratings, 'name': names, 'years': yearss, 'games': gamess})
dfs.append(df)
df = dfs[0]
for i in range(2, len(dfs)):
df2 = dfs[i]
df = df.append(df2)
df.shape
df.sort_values('rating', inplace=True)
df = df[df.rating>0]
df.index = np.arange(df.shape[0])
df.head()
df['rating'].hist()
The database records the number of games played. Naturally, more games are more opportunities for one\'s rating to rise. Surely some players of greater skill or point aquiring strategies will ascend faster. But by how much? The plot below seems to indicate it\'s a slow climb.
df['games'] = df['games'].apply(lambda x: int(x.replace(',', '')))
plt.figure(figsize=(10,10))
plt.scatter(df['games'], df['rating'], alpha=0.2)
plt.xlabel('Number of games')
plt.ylabel('Rating')
plt.ylim([1500, 3000])
plt.xlim([0, 500])
plt.show()