Rate Limiting - Chess Forums

bcurtis

Jan 21, 2019

0

#1

I just now needed to shut off access to the API for an IP address that was attempting to pull all PGNs, for all games, apparently for all players. This is decidedly against the goals of the API, which is to help developers build tools that help players enjoy and learn more from their games on chess.com. I hate blocking people, but I need to protect the site.

Downloading all games is foolish — even at the max rate limit of 3 archives per second, this would take 3 years of continuous operation... during which time enough games would have been played to require another 18 months to download. During which time enough games would be played... It might never end.

The person who created this Python script did not create a user-agent string that allowed us to contact him. I want to call attention to this text in the documentation:

In some cases, if we detect abnormal or suspicious activity, we may block your application entirely. If you supply a recognizable user-agent that contains contact information, then if we must block you application we will attempt to contact you to correct the problem.

Maybe it wasn't clear that downloading the whole library is a bad idea, but we couldn't contact this developer to explain that and work out a solution. What if we made it more clear?

Developers with working code, let's discuss some options. Would this work for you?

We establish a total limit per day. The limit should be about 3–5x higher than anyone currently uses for their programs.
This could be a limit on the total number of requests (which might penalize applications that consume a lot of small endpoints frequently), or the total bandwidth (which might penalize downloading large archives). Which is better?
We would place the information about your daily allowance in the response headers. When you get to 90% of the limit, a 429 "Too Many Requests" response will be delivered 50% of the time; at 100%, a 429 response is delivered to every request until the next day

In this way, you will know what's allowed and when you are getting close. If this happens often and you are building tools for the Chess.com players, then we can work with you to get special, higher rate limits.

What do you think? Comments or ideas welcome!

Tricky_Dicky

Jan 21, 2019

0

#2

Like others here I am am a non professional programmer. I am unfamiliar with the 'user agent string' concept. I can see the principle having looked it up now but as I am pulling the data using VBA into MS Excel I am not sure how that would work for me.

stephen_33

Jan 21, 2019

0

#3

We establish a total limit per day. The limit should be about 3–5x higher than anyone currently uses for their programs.

Not sure if that wouldn't cause me a problem. I use the site's endpoint data solely for running the TMCL & Knockout Leagues. I try to keep my endpoint requests to an absolute minimum but the work I do in TMCL sometimes requires downloading sizeable data-sets.

For example, when we come to collect & collate the results of all games in all the matches of some 87 teams, it might involve up to 400 separate match endpoints & some of those comprise of 300 boards or more! And that's something we need to complete in the shortest possible timescale to ensure fairness to all teams.

Because we adjust final scores in the cases of closed accounts (for fair play) by imposing our own two-point penalty, the web match result by itself isn't of much use.

bcurtis

Jan 21, 2019

0

#4

In most cases, amateur programmers won't need to worry about the user-agent. Dabbling in the API calls is not going to cause problems. It helps us find you if your system goes bonkers, though, or if you have a rich application that is doing so much it impacts other users. As an indication, we get about 600 hits to the PGN game-archive endpoint per week, normally, but the application I needed to block was doing 2000 per hour for a few days.

A user-agent string is just an HTTP header. The programming tool you are using should allow you to add headers to your request. You would want to add something like:

User-Agent: my-profile-tool/1.2 (username: Tricky_Dicky; contact: me@example.com)

Of course, you would put your tool name and your contact info in there.

bcurtis

Jan 21, 2019

0

#5

@stephen_33 , can you give me more stats on your usage? Are you saying that to score a match you need to hit 400x300 match-board endpoints?

If you have a unique user-agent for your program, I can scan our logs and obtain this info.

Tricky_Dicky

Jan 21, 2019

0

#6

Thanks Ben. I'll look at it although I suspect my usage won't impact too much on the server. My biggest request is flag logon's each day with stats for each profile.

stephen_33

Jan 21, 2019

0

#7

bcurtis wrote:

A user-agent string is just an HTTP header. The programming tool you are using should allow you to add headers to your request. You would want to add something like:

User-Agent: my-profile-tool/1.2 (username: Tricky_Dicky; contact: me@example.com)

Of course, you would put your tool name and your contact info in there.

Thanks, that's most useful & I know the site prefers to contact us via e-mail so I'll definitely include my e-mail address. What should I use as my 'tool name' - is that a script name/reference of some kind?

stephen_33

Jan 21, 2019

0

#8

bcurtis wrote:

@stephen_33 , can you give me more stats on your usage? Are you saying that to score a match you need to hit 400x300 match-board endpoints?

If you have a unique user-agent for your program, I can scan our logs and obtain this info.

No, nothing like that. I've managed to calculate team match scores from scratch, adjusting for closed accounts, using only the match endpoints themselves, nothing more. By the sounds of the problem you've described, my demands are way below anything that would be a problem.

The total number of sequential requests I'd be making would typically be (circa) 400 team match endpoints varying in size from 5 boards up to a few hundred. I can't see that putting much of a strain on the servers?

Tricky_Dicky

Jan 22, 2019

0

#9

I think I have got this. It seems that 'agent-user string' is just a literal which is logged by the data site for future reference if required.

I have included this in my call.

xmlhttp.setRequestHeader "User-Agent", "MS Office 365, username: Tricky_Dicky; contact: me@example.com" (but with my email)

This will send many hundreds of time for each procedure. Typically I think I would do 10k requests when I check flag logins each day.

I presume that that many header log entries is no issue for the site.

I don't as a rule check for return codes, just data, but should I be error trapping, in particular for return code 429.

stephen_33

Jan 22, 2019

0

#10

"I don't as a rule check for return codes, just data, but should I be error trapping, in particular for return code 429" - 429, which one's that?

But for users of the Python requests module, the method 'response.status_code' will return the status value: 200 is a successful response (I think), 404 is page not found & 410 an invalid url(?)

That's a big improvement on the somewhat inelegant error-trap I was using in vers. 3.3

Tricky_Dicky

Jan 22, 2019

0

#11

bcurtis wrote:

We would place the information about your daily allowance in the response headers. When you get to 90% of the limit, a 429 "Too Many Requests" response will be delivered 50% of the time; at 100%, a 429 response is delivered to every request until the next day

In this way, you will know what's allowed and when you are getting close. If this happens often and you are building tools for the Chess.com players, then we can work with you to get special, higher rate limits.

skelos

Jan 22, 2019

0

#12

In perl, I set my user agent string:

my $ua = LWP::UserAgent->new;

$ua->agent('@skelos/0.10');

In Python, with the Requests module:

headers = { 'User-Agent' : '@skelos/0.10 (Python 3.x)' }

...

s = requests.Session() # I think this is the way to get persistent connections

...

r = s.get(url, headers=headers)

...

Note that I'm a very part time Python programmer and wrote that code as an experiment. I do like the Requests module and it seems quite well documented.

Otherwise, I've hit four concurrent connections at least once by accident, but the first one to see a 429 stopped as I've coded all my scripts to do. 429 means I'm doing something wrong.

skelos

Jan 22, 2019

0

#13

bcurtis wrote:

In most cases, amateur programmers won't need to worry about the user-agent. Dabbling in the API calls is not going to cause problems. It helps us find you if your system goes bonkers, though, or if you have a rich application that is doing so much it impacts other users. As an indication, we get about 600 hits to the PGN game-archive endpoint per week, normally, but the application I needed to block was doing 2000 per hour for a few days.

...

2000 requests/hour is slightly more than one every 2s.

Unless there is a substantial volume of data (which there may have been in this case) that's very little.

I've noted before that I can't get to 10 requests/second single threaded even for small amounts of data and think that's inordinately slow.

skelos

Jan 22, 2019

0

#14

Tricky_Dicky wrote:

Thanks Ben. I'll look at it although I suspect my usage won't impact too much on the server. My biggest request is flag logon's each day with stats for each profile.

Be interested in a few countries and as well as stats (for ratings) want last_login (from the profile ... maybe I don't; maybe the time my program ran will suffice for that), number of groups (the clubs endpoint and get the names of the clubs when I only want a number), and the number of daily games (which means downloading current games when again only a count is wanted) is pretty inefficient.

Edit: profile contains the date someone joined chess.com. I need that. Perhaps having got it, and the user_id, I could fake last_login and only re-fetch the profile when someone changes their account name. This is getting more complicated than I would prefer it to be.

For a small group of players (XE for England is small) it's OK. For GB it's a lot more. For USA it's very big.

I hit the "I can't get 24 hours of data in 24 hours" problem a while back. More aggressive caching, give up on collecting some data at all and I'm (barely) surviving but have wondered if I'd be contacted about that one. Haven't been yet.

Edit: I notice my current update has been running 18 hours. It's not a bandwidth limit here I'm pretty sure; chess.com and other websites are very responsive. It's not storage speed here; SSD to eMMC (or whatever micro-SD cards use these days) makes no difference. It's not local CPU (older Intel i7 to current ARM 1.4Ghz CPU (Raspberry Pi) makes no difference).

skelos

Jan 22, 2019

0

#15

Re limits ... is the problem real or perceived? The example given sounds like someone wanted all games ever payed (such as lichess make available for download) ... perhaps chess.com should too.

Bandwidth ... the current (implementation defined? Intentional?) limit I see is on requests/second. A large enough bandwidth limit shouldn't bother me.

Requests total ... as noted, this would need to be very generous as if for a player you want profile, stats, club memberships and current games that's four endpoints. Plus whatever endpoint the player name came from.

My suggestion is to redefine the question: what problem was the heavy user causing you?

Too much bandwidth usage for api.chess.com (cost/perceived value? Impact on web services? Cost of CloudFlare? Other?)

Overloading api.chess.com or some component? (Bandwidth? Frequency of requests?)

I find api.chess.com very useful. I don't find it as useful as it could be due to the (observed) very slow performance. I'd rather discuss improvements to api.chess.com performance than additional limits.

Note I do respect 429 responses. I've hit four concurrent connections by accident and the first script which got a 429 response stopped at once and I said "oops!"

I note that the original single allowable connection would be nearly unworkable for me: if I've a report that runs for several minutes ("How balanced is this match? i.e. match lookup and then per-player stats lookup) and I want to download a player's archive to check some games ... I'd need to implement a queue.

I'm going to implement a queue if I make some of those reports available to other club admins via a website ... and then does the website get up to three concurrent accesses (one or more people, probably typically not me) and I get to continue with up to three concurrent accesses for my personal use, or have I overstepped the limit?

I'm not, by the way, keen to see an intermittent 429 response. Coding to recognise that would be a PITA. Block for 5, 10, 15 minutes or something like that. Or as I'm not sure what help it is to know I'm at 90% but not 100% of my daily usage just let me hit 100% and 429 every request I make.

skelos

Jan 22, 2019

0

#16

Re performance again: I've not done it (not in the spirit of things) but I am confident that for a number of the reports I use api.chess.com for web scraping would be faster.

Isn't the idea of api.chess.com to avoid web scraping?

skelos

Jan 22, 2019

0

#17

Addendum re saying who is making the request: so far I've not bothered (as @Tricky_Dicky apparently has*) of including script name as well as some ID in my requests, just my chess.com account.

*Edit: No, that was @bcurtis making a suggestion. Sorry for the confusion.

It's easy enough to add a script's name in there I suppose, if it's wanted.

WhiteDrake

Jan 22, 2019

0

#18

Tricky_Dicky wrote:

I think I have got this. It seems that 'agent-user string' is just a literal which is logged by the data site for future reference if required.

Yes, the 'User-Agent' string is an HTTP header sent by web clients (e.g. browsers) to web servers; it can be used to serve different content to different clients or for referring purposes. Some years ago, Internet Explorer was the most used browser, followed by Mozilla Firefox (Apple used Safari, a few people used Opera or the dreadful NN). Firefox complied pretty well with HTML and CSS specifications, Internet Explorer was pretty bad, NN was horrendous. I believe that originally the browsers were honest in the User-Agent header, i.e. Internet explorer stated there that it is Internet Explorer of this and this version and that and that rendering engine etc. So people started to use crappy Javascript workarounds so that their web pages would look the best they could in all browsers (say if User-Agent starts with Mozilla, do something fancy, else if it's Opera or Safari, do something else, otherwise at least display a boring version of the page, or tell the user to get a better browser - do you remember those stupid captions "This site is optimised for Mozilla Firefox"? ).

But all browser developers wanted for the websites to look great in their browser, so they started to say in the 'User-Agent' header that their browser is Mozilla of some type. And this nonsense remained until today. For example, this is what my Chrome sends to Chess.com in the 'User-Agent' string:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

That is, Chrome claims something like "I'm Mozilla of version 5.0, with rendering engine AppleWebKit, of type Chrome, subtype Safari. And other browsers send similar nonsense. I think it's pretty funny. BTW you can inspect this in your browser in the developer's mode (ctrl+shift+c should work in almost any browser on windows, then click on the 'Network' tab and look for the 'Request Headers' subsection of 'Headers', User-Agent should be somewhere at the bottom).

So, if you can put something useful inside the User-Agent string, it's more than one expects. I think it's a nice idea to put some identification or contact information if you want the web owner to contact you in case of some problems. But the header should definitely start with Mozilla/5.0.

skelos

Jan 23, 2019

0

#19

stephen_33 wrote:

"I don't as a rule check for return codes, just data, but should I be error trapping, in particular for return code 429" - 429, which one's that?

...

429 is the response code for "too many requests". You'll find many references around:

https://httpstatuses.com/429

There is a structure to the result codes (2xx, 4xx, 5xx) but 429 from api.chess.com means "slow down, back off". The only time I've seen it is when in error I had four connections and the permitted limit is three.

Pretty sure I wrote it up in my thread on return codes and what their likely causes are. If I didn't, feel free to add it.

(Other sites might use 429 for other reasons ... e.g. if they are suddenly inundated with more requests than they can handle. The gist is "come back later", as it's a temporary error indicating a resource problem on the server side. Exceeding three concurrent connections fits that more or less, because you're not being told your request is invalid, but that the site's too busy to handle it. In the instance of four connections "too busy" is because you're over your limit.)

bcurtis

Jan 24, 2019

0

#20

@Tricky_Dicky, consider naming your script (instead of "MS Office 365"), so that if you are running more than one we can tell you which one we are noticing. For example:

xmlhttp.setRequestHeader "User-Agent", "league-O-Matic/1.0 (username: Tricky_Dicky; contact: me@example.com)"

The logs always contain the user-agent; usually it is the default for the software you are using. So making this custom has no impact on logs.

@skelos, you are correct that 2,000 requests per hour is not a lot. Our servers generally handle about 2,000 per second, over all. The problem with this script was two-fold:

The endpoint it was hitting was heavy: monthly game archives can have thousands of games in them.
The script was not acting on behalf of the players, it was just harvesting (it started at the first username, grabbed all those archives, then the next username, and grabbed all those). The API is expressly and exclusively for building tools to help people enjoy chess. It is not for harvesting all game data in an infinite cycle. If there is a benefit to players in making such large amounts of data available, they would be distributed in a different way.

Skelos, you also wrote:

I am confident that for a number of the reports I use api.chess.com for web scraping would be faster.

Can you explain what? In my view, the only time scraping would be faster would be if the data were packaged for your specific use in a more appropriate way. The API responses from the server are typically less than 1ms if the data are cached, and 20ms if the data need to be pulled from source. The average www.chess.com server response time is 65ms. In both cases, the bulk of the time you see is network time, which can be 100–200ms each way. You would likely get a 5–10x speed increase if you ran your program from a server in Los Angeles.

I think that we can increase the number of simultaneous connections to 5. Hope that helps.

@WhiteDrake wrote:

But the header should definitely start with Mozilla/5.0

User-agent strings probably should start with that if they are web browsers rendering HTML.For example, many search engine spiders do not include the "Mozilla": https://developers.whatismybrowser.com/useragents/explore/software_name/googlebot/

For our purposes of custom clients consuming the API, something more descriptive is helpful. We use this string to group the requests in our logs, so we can see if any trouble we are tracking is coming from one source, and if so then it allows us to contact the owner.