Exact ELO of low quality engines or SF at lower levels's ELO or Leela's old versions's ELO...?

Sort:
Cylvouplay

I'm trying to select some engines to become somehow a scale of one's level or to scale other engine's level etc, by comparison.

It's pretty hard to find the right base of comparison.

Do anyone of you know the ELO level of Stockfish 9 when playing at lower level like 0/10, 1/20, 2/20, 3/20 etc?

From another side I do use the older versions of Leela that are more interesting because Leela just don't play well. It's not playing perfect, then suddenly blunder a queen. Could blunder a queen, but just as in it's blunder normal range, nothing choking and in general it's blunders are not obvious, so it's more like a beginner level for real. But we don't have any real knowledge about the ELO value of these old Leela engines. So if you know a formula that has really proved correct... I have one but it proves to be exagerated I think (around 200 higher than real ELO).

Or any low level engine that really plays like a beginner and whom ELO is known for sure...
The lowest on CCRL list is Zoe @ 1652, that would be fine, but developer says it's ELO 1400-1500...

As a reward I'll calculate the ELO of other engines and calculate which exactly old Leela versions are of certain levels. It would be interesting to know for sure that this Leela versions plays ELO 1400 and this other ELO 1600, wouldn't it? I'l share my results, just need a starting point. Any help may.... help!

EscherehcsE

I've never found an easy method of determining accurate ratings for these types of engines. The only accurate way would be for a number of federation-rated players to play the engines at various levels under tournament conditions. It would require a monumental investment in man-hours.

I just use a few of my favorite engines at certain levels as benchmarks to gauge my relative improvement over time.

Cylvouplay

Tx. I do have very different result depending from which side I calculate. If only I had just ONE SINGLE engine that has a really precise rating, the rest would follow.

I guess I'll start from Pierre 1.7 CCRL says 1667 and programmer use it on regular basis to test it against human in rating games and says it's around 1650. That's two different evaluation with almost exactly the same result so it may as well be a very accurate rating. Now I just have to make it play many many games against all others and calculate!

Still any other help may be precious. Better have two solid comparison points than one.

EscherehcsE
Cylvouplay wrote:

I guess I'll start from Pierre 1.7 CCRL says 1667 and programmer use it on regular basis to test it against human in rating games and says it's around 1650. That's two different evaluation with almost exactly the same result so it may as well be a very accurate rating.

I'm not sure I'd necessarily trust the CCRL rating for Pierre 1.7. It only played 20 games at 40/40, and it lost every single game.

EscherehcsE

Also, Pierre 1.7 JA seems to have a critical bug, at least when used in Arena. When it plays as White, it resigns for its opponent on move 1, which is an obvious illegal move. :)

Cylvouplay

 Wow Tx for opening my eyes.

If Pierre does an illegal move by resigning on move one, I'l punish him by making him loose the game. This way he's not going to repeat it. Need some discipline in the engines ranks otherwise...


Anyway I started a big tournament. Although I didn't see Pierre resign at all, I find hard to believe it's 1650. But it's always easier to catch other's blunders that our own so I may have a bias. Plus these are computer-style blunders very easy for us to detect, while I may do more blunders that I don't see but Pierre would never do.

But the biggest problem I encounter is the very high variation. All these programs, whether SF lvl 0 / 1 / 2 / 3, Leela at (allegedly) 500 700 1000 1100 ELO and crap engines, they all do blunder very badly, and the frequency of blundering is of course completely... chaotic. You could have SF9 LVL 0 playing very well 5 games in a row and beat Pierre and then your ELO evaluation will be completely wrong by far and for a long time. Only with thousands and thousands of games the ELO evaluation will start to be clean and fully relevant. Maybe I could cheat a bit and adjust with the knowledge of the relative strenght of the different Leela versions. Theses versions have been tested thousands of times and the relative strenght between them is a pretty solid information, that could be use a smart way in order to adjust the final ELO ratings without ruining the very point of the experiment, that is to give an anchor to the whole relative ratings.

Later I'll run these programms (at least the best of them) against stronger opponents whose ELO is known for sure, at least a little more precisely. Then it won't be too hard to come back to my low level engines results and adjust. I just need to keep one PGN very clean with all the results. Which is not that easy with Arena which is nice and I like but full of bugs. And I woudn't say the PGN way to make a database is far from being the smartest nor the safest as one small bug can easily destroy a lot of information. But I'll stick to my plan. Will let you know the results if any wink.png

EscherehcsE

Interesting. I tried Pierre 1.7 JA in the ChessGUI interface, and the "resign for opponent" bug is not present. Pierre works OK in ChessGUI, but not in Arena.

EDIT - Now it's strange - A little while later, after messing with the Pierre engine configuration in Arena, the bug is gone; And I didn't make any significant changes. I have no explanation for the bug initially being there, then mysteriously disappearing.

EDIT 2 - The bug reappeared. I played with some more settings, but I wasn't able to consistently get rid of the bug. I'd say this engine is simply problematic when used in Arena. Still, it works well in ChessGUI.

Cylvouplay

Tx for your contribution Esche!

BTW you posted something interesting here last year, I wanted to make use of it :
https://www.chess.com/forum/view/general/chess-engine-that-plays-in-the-range-1500-2000

But it turns out that there is no way with Arena to benefit the Elo limit (as far as I know).

Cylvouplay

UPDATE : I may have found something very promising here !

https://lh3.googleusercontent.com/-3EpNNlBuw_Y/WMGeUJYom8I/AAAAAAAAAck/j813NTmI_cIT8KH3QXOTqhfvkfFM6ynygCLcB/s1600/Texel%2BELO.JPG

null

EscherehcsE

Very interesting. Where did you find that link?

Cylvouplay

If only I knew! Firefox froze just after I found it, only had time to secure the image's adress on the clipboard, thinking it would be easy anyway to find it after that from google image by comparison or to find the link with google advance search... But impossible to find. Anyway it was only an answer in a forum and the discussion didn't seem very interesting to me at the moment. What I remember : there was only this image as real information. Nothing about were the image was from, no more relevant explanations.

I'm now testing this. Honestly this Texel at level 0/1000 is supposed to play ELO 800. But it's really playing so bad I hardly belive. Elo 800 is a rating that a beginner could play for real in chess isn't it? Like a beginner is able to understand it's better to take the opponent queen's than to move randomly a knight for no reason when the queen is exposed? Texel lvl 0/1000 is just playing almost random. It seems to me much lower than ELO 800. Even the very first Lila ID that is supposed to be ELO 500 (or around) crushes Texel lvl 0/1000 very badly 100% of time. Texel 0/1000 never wins even against the weakest possible Lila (Leela already blundering her Queen half of the time you know, prefering to save a pawn because she don't know what's strongest between both).

 

So... I'm perplex. More testing to come.

EscherehcsE
Cylvouplay wrote:

If only I knew! Firefox froze just after I found it, only had time to secure the image's adress on the clipboard, thinking it would be easy anyway to find it after that from google image by comparison or to find the link with google advance search... But impossible to find. Anyway it was only an answer in a forum and the discussion didn't seem very interesting to me at the moment. What I remember : there was only this image as real information. Nothing about were the image was from, no more relevant explanations.

I'm now testing this. Honestly this Texel at level 0/1000 is supposed to play ELO 800. But it's really playing so bad I hardly belive. Elo 800 is a rating that a beginner could play for real in chess isn't it? Like a beginner is able to understand it's better to take the opponent queen's than to move randomly a knight for no reason when the queen is exposed? Texel lvl 0/1000 is just playing almost random. It seems to me much lower than ELO 800. Even the very first Lila ID that is supposed to be ELO 500 (or around) crushes Texel lvl 0/1000 very badly 100% of time. Texel 0/1000 never wins even against the weakest possible Lila (Leela already blundering her Queen half of the time you know, prefering to save a pawn because she don't know what's strongest between both).

 

So... I'm perplex. More testing to come.

Well, the Texel 1.07 Readme file states that strength 0 *IS* a random mover. I might try testing it against BrutusRND to see if it really does make random moves.

(Just watching the first game, it does look like Texel strength 0 is either a random mover, or not far from it.)

madratter7

There is no real good way to calculate an ELO for a computer when it is playing people. Typically, playing against a computer starts out hard and gets easier. Another way to say this is we adapt to the way the computer plays, but the computer rarely adapts to the way we play at a given level. (There are computers that adapt their level as you play them - Fritz 16 is an example).

Example: Lets say that we are play a computer that usually plays strong moves but weakens itself by occasionally playing a complete blunder. When you first play the computer you may not know this and you play in your ordinary way (and get beat a lot). But if you realize it is eventually going to do something just completely stupid, you adjust by playing a safe, tactically controlled game. Eventually, it throws a piece away, (without you really needing to work for it), and you go on to win.

If you think that is an unrealistic example, there are many computers that weaken themselves in essentially this way.

EscherehcsE
EscherehcsE wrote:
Cylvouplay wrote:

If only I knew! Firefox froze just after I found it, only had time to secure the image's adress on the clipboard, thinking it would be easy anyway to find it after that from google image by comparison or to find the link with google advance search... But impossible to find. Anyway it was only an answer in a forum and the discussion didn't seem very interesting to me at the moment. What I remember : there was only this image as real information. Nothing about were the image was from, no more relevant explanations.

I'm now testing this. Honestly this Texel at level 0/1000 is supposed to play ELO 800. But it's really playing so bad I hardly belive. Elo 800 is a rating that a beginner could play for real in chess isn't it? Like a beginner is able to understand it's better to take the opponent queen's than to move randomly a knight for no reason when the queen is exposed? Texel lvl 0/1000 is just playing almost random. It seems to me much lower than ELO 800. Even the very first Lila ID that is supposed to be ELO 500 (or around) crushes Texel lvl 0/1000 very badly 100% of time. Texel 0/1000 never wins even against the weakest possible Lila (Leela already blundering her Queen half of the time you know, prefering to save a pawn because she don't know what's strongest between both).

 

So... I'm perplex. More testing to come.

Well, the Texel 1.07 Readme file states that strength 0 *IS* a random mover. I might try testing it against BrutusRND to see if it really does make random moves.

(Just watching the first game, it does look like Texel strength 0 is either a random mover, or not far from it.)

I had Texel 1.07 Strength 0 play BrutusRND. Out of ten games, each had one win, and there were eight agonizingly long draws, ending in either insufficient material or the 50-move rule. Yep, Texel 1.07 Strength 0 is a pure random mover. I'm not sure what rating you'd give it, but since it can't be negative, I guess 0 wouldn't be far off. :)

Cylvouplay

I'm completely wrong about ELO 100 and ZERO, I just was thinking incidentally about that and suddenly wondered where I went lost in the realm of illusions to produce such miserable mistaken interpretations happy.png Anyway the point still is that under around 800 ELO is probably irrelevant as a way to express ability more or less clearly. Still I didn't want to erase my mistakes, that was not the plan to hide it, just I edited instead of posting new, another mistake upon the previous one happy.png Mistakes never come alone.

EscherehcsE
madratter7 wrote:

There is no real good way to calculate an ELO for a computer when it is playing people.

<snip>

Agreed, although I think the main reason is that computer rating lists seldom correlate to human rating lists. (Computer ratings tend to be higher.)

I continue to like HIARCS and Delfi Trainer for reasonably natural moves. (Both are commercial, and I'm not sure if you can still get Delfi Trainer.) As free engines go, MadChess might be worth looking at. It uses four parameters in its LimitStrength algorithm:

https://www.madchess.net/the-madchess-uci_limitstrength-algorithm/

Cylvouplay

Yes it seems they did a huge work on that.

Cylvouplay

On Houdini 5 we have an exact ELO correspondance given when we configure the engine. Like if you configure H5.1 lvl 23/100 let's say, then it displays "correspond to ELO 1690".

That's quiet exactly what I was looking for happy.png

This plus the already done work on that matter, I'll have a nice scaling team I guess.

As you may know I don't have H6 but considering some solutions. Still 5 is already very interesting. Giving me an alternative the time needed to make my mind about this 6.

 

The Loopwatch Bug is fought by having smaller tournaments. More tournament of smaller size. And for some reasons, it gets better after changing the graphic card used. Makes no sense but voila.

 

So I'm back on the matter, Arena running 24/7 wink.png

EscherehcsE

One thing I like to do when running very long tournaments in Arena is to make periodic manual backups of the six tournament files for that tournament. That way, if anything disastrous happens, I can just reload the last good backup of tournament files and resume the tournament from that point.

Cylvouplay

Tx for advice happy.png