
US Chess Analytics - Part 1 - Data Collection
Introduction
A few years ago, I started a blog series on web scraping with Python and applying that to getting the USCF data to do some neat analysis. That was shortly after my twins were born and I severely overestimated the amount of free time that I could have Two and a half years later, I decided to pick this project back up.
In this post, I will give a quick overview of the data collection effort that is underway and a roadmap of the project.
Data Collection
I'm sure that I don't need to explain to American players that the USCF website contains a record of all rated games played from 1991 to current. Most players spend a significant amount of time on this website waiting to see how many points they won or lost at the latest tournament and comparing themselves to all their friends! The website is archaic and clunky, but it is structured so it lends itself to a web scraping activity.
Please note that I am NOT encouraging anyone to build the distributed web scraper that I will describe here. Once this scraper is finished, I will make the database freely available to anyone who wants to help with the project.
Here is my approach:
Every USCF rated player has a page like this one that gives their rating information, ranks, etc... At the bottom of this image, you see a link for Game Statistics where you can search for results by many different dimensions. To build our database of the results of all USCF rated games, we will focus on the "Record by Year" feature.
This is all great stuff, but to analyze this at scale, we need to be able to extract the data. To do that, we can look at the underlying HTML:
You can see that hidden within the text soup that creates the webpage is a very structured way of referring to the elements in the page. If you look at the URL for the pages, you will also see a set structure. We can simply replace the player id and the year to get the information that we desire for any (ALL!!) players.
http://www.uschess.org/datapage/gamestats.php?memid=12681247&ptype=Y&rs=R&dkey=2019&drill=Y
The strategy is therefore clear: write a program that will iterate through all players for each year between 1991 and 2020 and gather all of the data for analysis.
Volume
How much data are we talking about? The latest version of the USCF Golden Database, containing all USCF players had 1,010,463 rows ! If we are clever in how we filter this list, we can get down to ~600-700k players that we need to look for.
Doing quick math, we need to look for 30 years for 700k players = 21 Million links!!
If our program could execute this task at 5 players per minute, it would take a single machine 97 days to complete!! Therefore, we try to get more machines working on it. Currently, I have 20 workers on the job and should take approximately 5 days!
Again, I DO NOT ENCOURAGE DOING THIS YOURSELVES! This is a one-time activity for me and then will make my data available for free. I do not want to see people taking down the USCF website needlessly. I am balancing the number of workers and website hits and monitoring their performance to ensure that I am not doing damage to them. Please be careful when doing this type of web scraping at scale. For this reason, I am not going to share my code repository that I am using.
Analyzing
I am only a fraction of the way done with this job and my database is already 10M rows is several gigabytes large. When it is done, it will be truly massive and would be difficult to analyze on any single machine. Therefore, to analyze this data, I will be using a Spark Cluster running PySpark. This will be particularly important once I start to do Graph Analytics and will need to run GraphX.
Initial Cool Results
It is a well-known "fact" that IM Jay Bonin is the most prolific chess player in the US. Since 1991, Jay has played 15,552 Regular Rated games! This is a mind blowing stat. In 2003, Jay played 716 regular rated games. This is an average of 2 games per day for an entire year.
When thinking about this project a few years ago with IM Denys Shmelov, we defined what we call the Bonin Number. Your Bonin Number is N if you have played N distinct opponents at least N number of times.
It is clear from the definition, that the minimum number of games required for a B# of N is N2.
What is truly remarkable, is that Jay's B# is 48 ! He has played 48 distinct opponents at least 48 times each. Reflect on this stat and think about what you think your B# is!
What's next?
When I write again, we will start to examine the most prolific players in the US since 1991, look at where chess is and is not being played in the US, and various other neat nuggets. We will also start to look more at the Bonin Number for all players throughout the US.
If you have any ideas, please let me know and I will see if they can be worked into the analysis.
Thanks for reading!