I am not sure if you should pick one engine. Humans don't think based on one set of preplanned criteria. Does a microwave make hot dogs more like a human than an oven/stove or campfire?
Depends on the person. If you think like Stockfish in the beginning and then have an idea which is 1500 points lower, it could still be good and Stockfish could still play the beginning of that line of moves but after say the 3rd or 4th move you show a 2000 rated move. That doesn't mean you are 2000, it just means you are starting to "crack". Finegold says "Finally they are proving their rating" on streams when it appears a 1400 player is playing like a GM in the beginning. Eventually, that 2000 rating decision making falls and the person also shows their real true rating.
So, the point I am making here is that we should use higher rated engine criteria to assess human play in the beginning. When the middlegame is reached, there should be an evaluation of the "humanness" of that position. Or, more to the point, lack of computerness. How many rungs down the ladder should you evaluate a player. Let's look at this example.
After move 13, I wouldn't think like a GM. I would never have considered rerouting the N like that with "so many" moves. If you look at the computer evaluation of the line I would have done it is not too far off of the position after Qb7 is played in the GM game and if Qb7 was played after Nd2 in my line. We are talking from what I see a -.3 difference. That's hardly a losing evaluation, but I know I am nowhere near GM level. So, it must be the beginning of something.
Nc4 is giving me some handicap, but I think I could have seen that. Now look at the Stockfish evaluation. It is starting to show 0.00 vs. worse position in the the GM line after 21...Rac8.
I think something has to change to indicate this difference. It's not number of games a computer plays to get a human rating evaluation but it should be a positional shift. I worked towards a rook c file differently than Naiditsch did. Would an 1800 level player have done that? If no, then you know there needs to be a shift in computer settings. If yes, then compare it to 2000, and then 2200, etc... until you get a no. Adjust accordingly.
But it's not based on multiple games, it's based on progress in one game.
I made some research and testing to find a chess engine with a rating configuration that can compare with human rating:
1) Chessmaster 11 (by Ubisoft):
In several forums I read that Chessmaster personalities got their rating after being tested with USCF human players. They played several tournaments with real rated humans to establish each personality rating. That looks great! But....
But at the same time one of the Chessmaster creators says in the forum that they didn't play enough number of games to establish an accurate result, and that's the reason CM personalities are some hundred points away from the real value (by default or by excess??)
2) CCRL list:
One might think that this rating list is accurate because it is calculated with Bayeselo software. But I read that Bayeselo is not reliable because it provides a different result when comparing with human rating, especially for higher rating values. There is a paper about CCRL list rating vs human rating:
http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=27485
3) Rodent IV, Shredder, Wasp, Arasan, etc:
In Arena I played several tournaments to compare several engines including the ones mentioned. As a result, these engines rating configuration strength match with each other, for example in the same tournament Rodent iv configured at 2000 rating wins vs Shredder at 1900 rating and loses vs Arasan at 2100 rating, etc.
However, these engines ratings strengths are different when compared with Chessmaster 11 personalities when configured above 1800 points.
Taking as a conclusion that those enegines have same rating strength, I took Rodent IV to compare with TheKing: I played several tournaments in Arena using TheKing v3.50 (Chessmaster 11 engine) vs Rodent IV at different ratings. I set in Arena a new CM engine with each personality and I set "gauntlet" tournaments Rodent IV vs these personalities. One new tourrnament for each rating configured. For example this is the result Rodent IV set at 1600 rating (1700 USCF):
It can be seen clearly that Rodent iv at 1700 USCF is starting to lose vs CM 1700 rating and above and winning vs CM 1600 rating and below. It is a coherent good result, and it can be seen clearly.
But the problem comes when the ratings are above 1700. For example Rodent iv at 2400 (2500 uscf):
As you can see it's beating Queenie (2873), Logan (2820)... (almost 400 points difference).
Another example with Rodent iv at 2600 rating (2700 uscf):
Rodent iv at 2600 beats all CM personalities like butterflies thrown against a train.
After playing several tournaments (one for each 50 points increment, I made this comparison table):
Rodent iv rating vs CM rating:
rodent iv 1600 rating --> match with CM
rodent iv 1700 rating --> match with CM
rodent iv 1800 rating --> From here ahead don't match with CM
rodent iv 1900 rating --> 2100
rodent iv 1950 rating --> 2150
rodent iv 2000 rating --> 2200
rodent iv 2050 rating --> 2250
rodent iv 2100 rating --> 2300
rodent iv 2150 rating --> 2350
rodent iv 2200 rating --> 2400
rodent iv 2250 rating --> 2600
rodent iv 2300 rating --> 2700
rodent iv 2350 rating --> 2600
rodent iv 2400 rating --> 2780
rodent iv 2450 rating --> 2850
rodent iv 2500 rating --> Wins vs all CM personalities in 80 and 90% --> 2900 rating??
rodent iv 2600 rating --> Wins vs all CM personalities in 90 and 100%. --> +3000 rating???
It looks like it's growing exponentially.
That comparison table is not very accurate, since I think you need more than 5 games per player to get more consistent results. I made just to get an approximate idea.
So the questions are:
Which engine is better to compare with human rating??
Chessmaster had the best method to compare because they played it vs human players, but unfortunately they didn't play enough games to make it accurate, according to one of the creators.
Is Rodent IV, Shredder, etc good to compare with human ratings?? How did their developers set their rating strength? did they use Bayeselo or CCRL list (which are innacurate acording to that paper)?
Stockfish uses 20 levels to establish rating strength. So, it's not a good idea to compare with Stockfish estimated rating.
Rodent iv now can be run on android, so it would be nice if someone can produce an accurate table of comparison human vs Rodent iv ratings. But in my table I'm comparing CM rating vs Rodent iv and i don't know if CM ratings are accurate (or how much human-rating accuracy the personalities have).
Thank you for your posts in advance