Experiencing tons of server errors.

Sort:
theegreentree

Hi,

I have been having a lot of server error issues when I get on the website. Definitely more than normal. It happens about every one in four times when I used to happen very rarely. I there any steps I can take to reduce the server errors or are you able to fix it?

Thank you,

theegreentree

theegreentree

Hi,

I have been experiencing an abnormal amount of server errors. They happen about a quarter of the time I get on Chess.com. They used to happen very rarely. Is there anything I can do or can you fix it.

Thank you,

theegreentree

Martin_Stahl
theegreentree wrote:

Hi,

I have been having a lot of server error issues when I get on the website. Definitely more than normal. It happens about every one in four times when I used to happen very rarely. I there any steps I can take to reduce the server errors or are you able to fix it?

Thank you,

theegreentree

 

The site is experiencing really high loads and it's a site-wide thing. Staff are actively working on optimizations and capacity increases to alleviate the issues. 

theegreentree

Ah, I see. Thanks for letting me know. Having so many too many people go on a site usually is a good sign as means people love your website. Thank you for making Chess.com a better place!

Elroch

I don't accept the excuse - it is not plausible that there has suddenly been an increase of a magnitude that could not be anticipated using common sense. It doesn't fit the facts either - since usage is very bursty, there should be occasional glitches due to the occurrence of peaks in activity unknown in the past. Instead, there are persistent very common errors all the time: the system is not fit for purpose.

It's all part of a general degradation in the quality of the site over time. For example, for a long time, when you posted a new post in a forum, you would then see the forum with the new post. Now, due to some combination of poor performance and bad code, the forum is displayed with the post missing. One refresh sometimes makes the new post appear. Sometimes it takes two refreshes or more. This sort of poor performance is essentially unknown on the modern Internet except on chess.com. It would be really good if someone with influence considered why this is and did something about it (seems a desperate wish).

I feel being over-polite about this has been inappropriate for some time now.

trimalo

Please be patient chess.com IT people are on the deck and work hard, new servers need to be put in service and it will take few weeks.

Elroch

I think the way the site is managed is uniquely irredeemable now.

Martin_Stahl
Elroch wrote:

I don't accept the excuse - it is not plausible that there has suddenly been an increase of a magnitude that could not be anticipated using common sense. It doesn't fit the facts either - since usage is very bursty, there should be occasional glitches due to the occurrence of peaks in activity unknown in the past. Instead, there are persistent very common errors all the time: the system is not fit for purpose.

It's all part of a general degradation in the quality of the site over time. For example, for a long time, when you posted a new post in a forum, you would then see the forum with the new post. Now, due to some combination of poor performance and bad code, the forum is displayed with the post missing. One refresh sometimes makes the new post appear. Sometimes it takes two refreshes or more. ...

 

The site certainly has been seeing increases in traffic for a few weeks and most of the issues are occurring during normal peak times. It's that the peaks are a lot higher.

 

My understanding is the reason for the higher peaks isn't really know and likely is a combination of a lot of different things. Similar to the pandemic and Queen's Gambit spikes, it's an unexpected increase, higher and more sustained than planned for loads and growth.

 

During higher loads, submissions, such as forum posts, are queued and asynchronously written to the DB. So when things are overloaded, you can see a delay between posting and being able to view the post.

 

It's also possible reads are load shared across replicated database instances and there are multiple webservers that could be reading from a DB that doesn't yet have the replicated database.  I don't have much insight into the site's architecture, just how it's possible to handle loads and spread connections, so I could be off some in the explanation.

Elroch

For several months (but probably less than a year) there has been a delay in posting long enough to mean that the forum is consistently displayed without the new post. It worked for many years before that.

If anything, there should generally be economies of scale - providing a similar service to a larger membership is more economical. Where is another company that has not successfully avoided such problems?

Martin_Stahl
Elroch wrote:

For several months (but probably less than a year) there has been a delay in posting long enough to mean that the forum is consistently displayed without the new post. It worked for many years before that.

If anything, there should generally be economies of scale - providing a similar service to a larger membership is more economical. Where is another company that has not successfully avoided such problems?

 

I know that asynchronous writing to the DB has been in place for a while. It's a mechanism to spread out load in the databases. Unfortunately, during high loads, it  certainly becomes more noticeable.

 

As mentioned, I don't know the current architecture, but splitting content areas into their own databases and servers is likely planned, if it's not currently being done. Scaling ends up needing to be done by splitting out services. 

 

@erik has posted before about how early design decisions can cause unexpected growth pains later, such as what's happening now. My understanding is that effort has been going towards resiliency since the end of the previous spikes too. This recent traffic was unexpected.

Elroch

The last comment does suggest some recognition of the source of the problem and what needs to be replaced/migrated. I don't claim expertise, just some common sense and limited experience, but there are people who could be usefully consulted.

oilplank

"I feel being over-polite about this has been inappropriate for some time now."

I'm coming to agree with this. About 10% of all my games are being aborted on a regular basis since late December or so, or worse, sometimes the server hiccup resolves fast enough that the game isn't aborted and I've lost meaningful clock time. A single tweet 10 days ago merely acknowledging that there's a problem with no follow-up information isn't satisfactory to make we want to stick around and pay them for a subscription indefinitely, simply hoping that someday the core game-playing functionality of the site starts working again. I could speculate that these things take time to fix, but it's not my job to speculate here.

More information through official channels please, ideally with an expected timetable for a resolution.

n00bdrm

I like playing puzzles on my iPhone. I frequently have to stop because I get either “our servers made a dubious move” or a similar error (which causes the puzzle to repeat over and over. Very annoying - especially as a  subscriber

oilplank

...or still-worse: in my last 3 games this morning:

1. winning position, abort due to server error

2. totally legitimate loss with no server errors ;p

3. overwhelmingly winning position, server error and no access to the game to make a move, 15 minutes later the server comes up and marks it as a loss on time because i had the move when the server went down.


Yeah I'm done here.

theegreentree

Yeah, I lost because of a sever error that chess.com said was their fault. I do understand that server errors might happen if lots of people visit thier site. I just hope it is fixed soon though.

Martin_Stahl
oilplank wrote:

"I feel being over-polite about this has been inappropriate for some time now."

I'm coming to agree with this. About 10% of all my games are being aborted on a regular basis since late December or so, or worse, sometimes the server hiccup resolves fast enough that the game isn't aborted and I've lost meaningful clock time. A single tweet 10 days ago merely acknowledging that there's a problem with no follow-up information isn't satisfactory to make we want to stick around and pay them for a subscription indefinitely, simply hoping that someday the core game-playing functionality of the site starts working again. I could speculate that these things take time to fix, but it's not my job to speculate here.

More information through official channels please, ideally with an expected timetable for a resolution.

 

The site is actively working on optimizations and capacity increases to alleviate issues. It takes time to make major changes but my understanding is that work is their primary priority.

oilplank

How much time (roughly), and where should I look for ongoing official announcements on the issue?

Martin_Stahl
oilplank wrote:

How much time (roughly), and where should I look for ongoing official announcements on the issue?

 

I don't think there is an official ETA and there are multiple efforts in progress to assist in reducing issues. Each likely has different potential times to complete.

 

I would assume that once things are more stable, die to some of those efforts, there might be an article/news item released, but I don't know that sure 

willbonness

So they will be issuing refunds to those who paid for premium memberships and can't access the site due to these server errors, right? 

....right?

Martin_Stahl
willbonness wrote:

So they will be issuing refunds to those who paid for premium memberships and can't access the site due to these server errors, right? 

....right?

 

I don't think anything has been stated but the site's #1 priority is getting things stable and working in keeping things stable even with additional growth.

 

Once things are stable, they should be able to decide what they're going to do and will have a better idea what the member impact was; for example, how long.

 

https://www.chess.com/blog/CHESScom/chess-is-booming-and-our-servers-are-struggling

Guest3991719175
Please Sign Up to comment.

If you need help, please contact our Help and Support team.