Game Review Common Confusion

Duckfest

Updated: Oct 15, 2024, 11:18 AM | 6

Introduction

Reviewing your games is one of the best ways to improve your chess. The chess.com Game Analysis is one of the most useful tools to assist you in the reviewing process. It helps you find flaws in your play and gives shows better alternatives.

But the feedback can also lead to confusion sometimes. In this article I'll discuss the mechanics of engine analysis and the Game Review to clear up some common misconceptions. Any confusion remaining after you've read this article is likely caused by your lack of understanding of the positions. That's not for me to solve, that's above my paygrade.

What do I mean by confusion? Every time the feedback makes you think 'how can move X be the best move?' or 'how can move Y be an inaccuracy, not a blunder?' or "what crack is our beloved bot on?"

This will not be a how-to guide teaching you how to use the engine to improve your chess. The focus of this article is on the feedback a player receives in post-game analysis and how the output can lead to confusion.

1. Engine Exaggeration

We have all had this happen from time to time: During the game review you are trying to find out why your move wasn’t the best move, you click on the top engine move and as soon as the move is played the engine tells you “Excellent Move!”. What? Why isn’t the best move the best move?

Stockfish recommends multiple moves

It’s a very common source of confusion as can be seen by topics like inconsistent analysis, stockfish changing its mind, how can the best move be inaccurate, why is best move a mistake and can someone explain this move? That last topic is a great example.

Example from the forum

OP asks Why's the computer saying Bishop to a1 is the best move for white?

Position in the forum topic

The way the information is presented is misleading, because it creates the sense that you failed to find the best move.

In reality the engine evaluation is much more nuanced. Here is the plain Stockfish output:

Stockfish recommendations

From a human perspective, it’s nonsense to declare one of these moves better than the others. No chess player can explain the difference between + 6.1 and 6.0 evaluation. All five candidate moves are about equally strong. When I did the analysis the Game Review told me: Bd4 is best. A different answer, but stated just as firmly.

The full picture of these five lines and their evaluations wouldn’t give a player the impression Ba1 is a move that needs further exploration.

Takeaway: Instead of presenting multiple moves as equally strong, the Game Review will pick the top engine move and call it best, exaggerating the relative evaluation.

Example from my own games

Here is a very common position when playing the London at lower ratings. It can be reached through multiple move order, for example 1. d4 d5 2. Bf4 Nf6 3. e3 Nc6.

In this position the 10 best moves are all within a 0.30 evaluation range of each other, meaning that the 10th best move is less than 0.30 worse than the top engine move. The best 5 engine moves are closer even. Practically, this means that sometimes Nf3 is considered the best move. Sometimes c4. Sometimes Nd2. Very close are a3, Bb5, Nc3, Bd3, c3 and Be2.

2. Stockfish Scapegoat

At the root of the miscommunication lies the distance between Stockfish’s output of abstract calculations on the one hand and the chess players that are still developing their understanding of the game on the other hand. And in between these two extremes is chess.com as an intermediary, building functionality to translate the data to players in easily digestible formats.

It’s important to realize that when players are confused by the engine analysis or by Stockfish, most of the time they are talking about the chess.com Game Review.

Let me be clear that both the Game Review feature and Stockfish are great tools to help you improve. But the Stockfish engine and the Game Review aren't the same thing! They are neatly integrated by chess.com which makes using them super easy, barely an inconvenience. But this integration also leads to confusion as demonstrated in the previous chapter.

Stockfish is simply the chess engine that calculates values for positions.
Game Review is a chess.com feature that translates the engine output into feedback for the player.

Computer Engine

When I said Stockfish is simply the chess engine, what I meant to say was: Stockfish is absolutely amazing, with unrivaled performance it remains the best chess engine in the world. When I grew up it was unfathomable to have a tool as brilliant and as powerful available for anyone to use for free.

However, there are a couple of things to keep in mind when using Stockfish

Stockfish evaluation inevitably comes with variance, tiny differences in outcome for each calculation. The engine lines used as an example in the previous chapter might be in a different order if I let the engine run for 2 minutes longer.
Stockfish evaluates positions, not moves. Instead of evaluating candidate moves and comparing them, Stockfish evaluates the positions each candidate move might lead to and uses that to compare them. Logically, this makes perfect sense, but the results can sometimes be counterintuitive.

Game Review

The Game Review then adds an additional layer of information (interpretation). Notable features are the highlighting of key moments, suggesting best moves as well as pointing out mistakes and blunders. The purpose is to clarify the engine output and make it easier to understand. However, this feature should come with a warning, because of these issues:

Oversimplification. When multiple positions have the same Stockfish evaluation, or practically the same, it’s hard to tell which one is best. Presenting one move as the best move can be misleading.
Move Classification is subjective. Feedback has more meaning when a player like Magnus is judged to a different standard than a 300 rated player, but it can also lead to confusion.
Finally, there is the Coach feature that tries to come up with explanations in plain English to explain why a move is good or bad. I find it entirely unreliable so I‘ve never used that part of the review.

3. Evaluation Equalization

There’s also the opposite problem of the one discussed in chapter 1. Instead of the Game Review confidently telling players that one of the moves is better than the other even though they are basically the same, in this chapter I’m talking about positions where the engine doesn’t distinguish between candidate moves because they lead to the same position eventually.

When all roads lead to Rome, does it really matter which road you take?

Stockfish compares positions not moves. This makes perfect sense logically, as a move is only as good as the position it leads to. We as humans tend to think more in terms of moves than positions, at least the lower rated players are.

Players should be aware of the way Stockfish calculates because it can lead to useless or dangerous conclusions.

Example from my own games

This position is a variation that could have been played in a daily game of mine from February '22.

In this simple example the top 4 Stockfish moves give exactly the same evaluation. Promoting the pawn to a Queen is just as good as promoting to a Rook, a Bishop or a Knight. Because Stockfish assumes that after bxa8 black will recapture with Qxa8, whatever the pawn promotes to, all 4 moves lead to the exact same position.

Top engine moves

The value all 4 candidate moves fluctuates constantly while the engine is running. I think because not all lines are updated simultaneously. As a result, there is a different top engine move every couple of seconds/minutes while the analysis is still ongoing.

Fundamentally, this feels wrong to me to evaluate these moves as equal to each other. They are only equal if black takes with Qxa8. In all other scenarios promoting to a Queen is better. Thus bxa8=Q is objectively best, maybe even mathematically best.

Takeaway: the engine will consider candidate moves as equal when they lead to the same position, even though one move is clearly better in most scenario's.

Example from my own games

Here is an example from one of my own games that I played several years ago. My opponent played an Englund Gambit and I refuted it successfully, gaining a winning position. However, I was not careful enough and I was too greedy. My opponent launched a counterattack and I ended up in this position. White to move.

For stockfish there are two moves that are equally good. But it's not hard to see that there's a significant difference between both line. In one line (Qxd2) white loses the Rook immediately after Qxa8+, in the other line (Kxd2) black can also take the Rook after Qd4+ and Qc3+. But that second line is much better against human players, because most players don't play Qd4+.

Always keep in mind that humans don't play like engines and can be counted on to make mistakes.

Takeaway: the engine will consider candidate moves as equal when they lead to the same outcome, even when one move is obviously better against human players.

4. Counterintuitive Commendations

In positions that are either completely winning or completely losing there's another common type of confusion that happens. Not because the Game Review is causing confusion, but the engine output itself is confusing. This is for the lower rated players that aren't aware of what the evaluation for the candidate moves mean.

Stockfish calculates positions, not moves. The top engine move is the move that leads to the best position at the end lof the line, but the benefit of said move might not be evident until much further down the line, depending on depth.

Example from the forum

In this topic there was confusion about how losing the Queen could be the best move, according to Stockfish.

Position in forum topic

The first issue is that this position is extremely winning for white and many candidate moves lead to positions that are also completely winning. It's hard to determine which totally winning position is the most totally winning positions. Or vice versa, in this case from black's perspective, when all candidate moves lead to positions that are totally losing, which one is the least totally losing position.

The second issue has to do with the difference between what the Stockfish output means and the way we perceive what a best move is. In the engine output shown above Stockfish gives the highest rating for black after 20. Nxe5 (taking a Bishop). Therefore, Nxe5 is the best move, even though it loses the Queen immediately.

However, this is calculated (here) at depth 22. What the engine suggests isn't that Nxe5 is the best move, but that 11 moves from now you'll have the best position if you play Nxe5 now. Sounds different, doesn't it? Because it doesn't put the confusing label of best move on Nxe5.

There is a third issue here, the issue of variance again. The newest Stockfish version finds both f6 and Qd8 as better moves than Nxe5. That's why you should always be aware that Stockfish, even though it's reliable, it's not as precise as we tend to think.

Takeaway: When many candidate moves lead to overwhelmingly winning positions the best move is not always as intuitive.

Example from my own games

In one of my own games I ran to an example to highlight the difference between an intuitively best move and the engine best move. This was a Hillbilly Attack played against me in a Daily game last year.

It was the last move, my opponent resigned after playing Kxd1. The game was over. I just happened to notice that Kxd1 wasn’t the best move. So I checked to make sure.

Top engine moves

Indeed, Kxd1 isn't the top engine move. The top engine move for white is a3, not taking the Knight. For black then the top engine move isn’t to save the Knight or defend it, instead it's Kg7 ignoring the Knight. The second move for white, Kd2, walking the King straight past the Knight, illustrates once more how irrelevant the Knight is.

Stockfish will win a position with a Rook and 4 pawns vs only three pawns with it’s eyes closed. Having one more Knight on top of that doesn’t improve the evaluation. It may even slow down the process.

Btw, at higher depths (45+) Stockfish will choose the natural looking Kxd1 as the top engine move. But still the evaluation for the various candidate moves is remarkably similar. One could argue that they are practically equal. Of course, as a human player you should take the Knight. And resign immediately after, just like my opponent did.

Chapter 5 Review Relativity

Another recurring theme on the forum is regarding the way the post-game analysis becomes very mild when classifying move in a position that is completely winning. When all moves are winning a move that appears to be an obvious blunder is only considered a mistake or even just an inaccuracy.

What is a blunder depends on the players and the position.

Example from the forum

The forum post that triggered me to investigate move classification was (this topic) where blundering the Queen was only considered a mistake, instead of a blunder which the player expected.

How can hanging a queen be a mistake but not a blunder?

In this position the evaluation dropped from + 8.1 to + 5.8 when white played 23. Qg6 instead of 23. Nd2. It’s marked as an inaccuracy, not as a blunder because white’s winning chances, as predicted by the chess.com algorithm, decreases by more than 5% but less than 10%.

The reason this happens has to do with the way chess.com classifies moves. Whether a move is a blunder depends not only on the change in evaluation, but also on who are players. Once I learned more about move classification on chess.com, I began to experiment with it to understand it more. This gave me some useful insights.

About Move Classification

In order to understand Move Classification it’s important to begin with the mechanics, before diving deeper into the implications. This is the information that chess.com provides here:

"Classifying moves is a mix of art and science. Where is the line between a good move and an inaccurate one? How is a blunder defined for a chess master compared with a new player? What matters more, going from +2 to +1 or from +0.7 to +0? What engine evaluation is needed for a position to be considered “winning”?

Expected points uses data science to determine a player’s winning chances based on their rating and the engine evaluation, where 1.00 is always winning, 0.00 is always losing, and 0.50 is even.

Basically, at 1.00 you have a 100% chance of winning, and at 0.00 you have a 0% chance of winning. After you make a move, we look at how your expected points (likely game outcome) have changed and classify the move accordingly. The table below shows the expected points cutoffs for various move classifications".

Image from Chess.com support

Summarized, the classification of a move is based on the change in winning chances for that player. So any move that decreases a player’s winning chances by 20% or more is a blunder, but if the winning chances only drop by 10% it’s just a mistake.

This wasn’t new information to me, but I’ve never really let the implications sink in.

Takeaway: Moves are classified based on how they impact the predicted winning chances of a player.

Example from the forum 2

What I did is take the PGN of the game I used in the previous example and copy pasted it into the analysis tool with one adjustment. I increased the Elo rating for each player by 1000. This means that the analysis is performed based on the assumption that the game was played between a 1665 rated player and a 1675 rated player. Then I repeated the process with another 1000 rating for each.

The first interesting difference occurred in the first 7 moves of the game when I compared the outcomes.

Evaluation changes matter more at higher ratings. Here are the first 7 moves played in the game: 3 Book, 2 best, 3 excellent, 2 good, 2 inaccuracies and 1 mistake. For players with a 1000 higher rating, rated at ~ 1700, the same 7 moves are 3 Book, 2 best, 1 excellent 3 good, 2 inaccuracies, 2 mistakes, 1 miss. For players rated another 1000 higher, rated 2700 the same move are classified as: 3 Book, 2 best, 1 excellent, 3 good, 3 mistake, 2 misses. See below for clearer comparison.

Move Classification for moves 1 to 7.

For lower rated players a change in evaluation of less than 1.0 doesn’t change the predicted outcome too much. For them, most moves are excellent or good with a few inaccuracies and only one mistake.

For GMs these changes in evaluation do matter. That’s because an advantage of + 1.5 can be enough for a GM to win a game. Going from + 0.6 to + 1.3 is a significant change. That's why the same moves played at GM level would be considered multiple mistakes and misses.

Takeaway: Game Review is much harsher towards higher rated players.

In the same game I also found evidence of the opposite happening, which I will show you next.

Example from the forum 3

This one was a real eye-opener for me. This time the GMs get the better review and the lower rated players are considered to be blundering.

In this position the evaluation dropped from - 4.25 to - M5 (mate in 5). For the 700-rated players of this game this is considered a blunder. White’s chance to win the game went down more than 20% (percent points).
Ironically, the same move played between two 1600-rated players is only a mistake. According to the classification table this means that the winning chances only went up by 10% to 20% (points).

For 2700-rated players it’s even more distorted as the move 32. g3 is only considered to be an inaccuracy. Why only an inaccuracy? White's winning chances decreased by less than 10% (point), likely it was already below 20% before g3 was played.

Takeaway: Game Review is much harsher towards lower rated players when a position is completely winning, or completely losing.

Conclusions

These are my main findings of my research into Game Review, engine evaluation and move classification.

Stockfish top engine moves come with some variance each time you run the engine, that means that when multiple moves are almost equal in evaluation, that it’s not always the same move that comes out as the best move.
That’s why you shouldn’t blindly focus on what the Game Review states as the best move. The underlying engine evaluation for each candidate move might be more nuanced.
Stockfish evaluates positions, not moves. This can mean that multiple candidate moves will result in the same position in the end. Be aware that both lines will only converge when playing against Stockfish, against human opponents they might play out entirely differently.
When positions are completely overwhelmingly winning, the engine recommended moves don’t always make sense. A move that intuitively feels like a blunder might actually be the fastest way to convert a winning position.
The Move Classification (like blunder or excellent move) is based on winning chances. A move that intuitively feels like a blunder may not be classified as a blunder if the move has little impact on the predicted outcome of the game.

Thank you all for reading. please let me know if you have anything to add or and questions.

Game Review Common Confusion

3. Evaluation Equalization

Duckfest’s Digest