Convert pdf into a chess format - Chess Forums - Page 2

frojasg1

Aug 26, 2021

0

#21

The application does not work with figurine notation at the moment, but I am planning to improve the application to manage that kind of notation too.

May be it will be for next year.

NM nbi1

Sep 4, 2021

0

#22

What we need is a trainable OCR program which creates a map of input symbols to output PGN. I've done a fair amount of research on this, but have not found anything suitable. Everything I've found is badly flawed in capabilities and/or implementation. What I would like to see is an OCR engine with a GUI interface that lets you train the OCR by opening a PDF document and selecting source symbols (either figurine or text). Whenever a symbol is selected a popup dialog prompts the user for the corresponding output PGN (for example a figurine queen would get mapped to 'Q'). You get the idea. The complete training may be a bit tedious, but it isn't as bad as it seems. The complete set of mapped input would include text, figurines, and ECO symbols. That's really not a large mapping set. I don't care about diagrams because they are trivial to construct from the PGN. When you consider the time savings achieved over manually processing an entire document it's obvious that such a trainable OCR would be a huge boon to productivity. Another huge plus is that the OCR would not need to be re-trained for other documents from the same publisher as they are likely to use the same font and symbol sets. I find it astonishing that in 2021 no such application exists. Yes, it's a fairly intense development effort, but this is one of those apps that people would gladly pay for especially since the OCR portion doesn't need to be chess specific. I've been accumulating much chess PDF material and am getting rather exasperated over not being able to convert it to PGN so maybe I'll take a stab at this project.

frojasg1

Sep 4, 2021

0

#23

Hello nbi1,

Yes, what you have explained in the previous message was nearly my idea when I analyzed how to program it...

But I have lack of hours in the day (days would be better if they had 36 hours :-D)

I have just started the development of a new application (a Java music player), so at the moment I will not have time to start that new version of ChessPdfBrowser.

If you are familiar with Java development, some months ago I started to work on a Pdf library that extracts the images from characters. The library is finished, and ready to be used, but there is still a lot of work to have the OCR + the modified text extractor that takes into account those images.

I have analyzed it and I have found that it is affordable, but it is not a trivial task, so I will have to work a lot of hours in order to have the new version of ChessPdfBrowser finished.

That will be probably next year.

At the moment, if you want to take a look on the Pdf library, you can download it from:

https://www.frojasg1.com:8443/downloads_web/downloadServletv3?file=pdfInspector.v1.0&origin=chess.com&language=English

NM nbi1

Sep 4, 2021

0

#24

frojasg1 wrote:

Hello nbi1,

Yes, what you have explained in the previous message was nearly my idea when I analyzed how to program it...

But I have lack of hours in the day (days would be better if they had 36 hours :-D)

I have just started the development of a new application (a Java music player), so at the moment I will not have time to start that new version of ChessPdfBrowser.

If you are familiar with Java development, some months ago I started to work on a Pdf library that extracts the images from characters. The library is finished, and ready to be used, but there is still a lot of work to have the OCR + the modified text extractor that takes into account those images.

I have analyzed it and I have found that it is affordable, but it is not a trivial task, so I will have to work a lot of hours in order to have the new version of ChessPdfBrowser finished.

That will be probably next year.

At the moment, if you want to take a look on the Pdf library, you can download it from:

https://www.frojasg1.com:8443/downloads_web/downloadServletv3?file=pdfInspector.v1.0&origin=chess.com&language=English

Thanks. I think we're on the same wavelength with regards to the effort required to implement this. Recently I've been viewing the most up to date Tesseract documents to get a sense of any improvement since I last looked at Tesseract some 10 years ago. Sadly it's as painful and tedious now as it was back then. I naively hoped someone might have created a language file for chess figurine algebraic, but no such luck. I also considered a design without an OCR base such as an implementation using ImageMagick, but that is a herculean task which would likely yield a fragile one-off solution. I congratulate you for recognizing the need and working on a solution. In all the years that PGN has been around you seem to be the first developer to tackle this project. I completely understand the time demands made by such an undertaking. The bad state of affairs with Tesseract and other OCR software pretty much guaranteed that I couldn't get involved. I do have more time now so if I'm feeling ambitious or just struggling with insomnia I might take another stab at Tesseract.

frojasg1

Sep 5, 2021

0

#25

Sleeping is a basic need, so do not discard it !

NM nbi1

Sep 6, 2021

0

#26

Tesseract detects text characters just fine, but it's inability to cope with figurine algebraic is a showstopper. It does detect individual figurine objects, but it does not interpret them in a consistent way. Three instances of a bishop occurring in the same paragraph could be interpreted in 3 different ways due to slight differences in the pixel layout. That makes the whole exercise useless as going through an entire document to correct misinterpretations is a non-starter. This problem could be solved by a filter function having the logic "interpret as x any object reasonably close to the specified image xyz". In other words be able to specify the attributes of what constitutes a match, but Tesseract does not have such a capability. As I'm not inclined to rewrite and/or enhance Tesseract I guess I'll need to look elsewhere.

NM nbi1

Sep 6, 2021

0

#27

PieDay314 wrote:

chessroboto wrote:

A poster recommended https://ebook.chessvision.ai. We discussed it here.

https://www.chess.com/forum/view/general/forward-chess-chess-studio-e-chess-books-chess-king-chessable?page=2

We have yet to back from more chess.com users for feedback. It won’t convert your legally-owned PDFs to pin files that you upload to your own database or engine, but you can use their interface for analysis and to run Stockfish on.

Thanks for your help.

This just picks off the diagrams from a PDF. While that is a very helpful feature it's not what some of us are after which is converting figurine algebraic notation to PGN. I really don't care about the diagrams because if you have the PGN it's easy enough to insert/generate diagrams at any point in the game. It's the generation of the PGN that's the primary issue.

frojasg1

Sep 10, 2021

0

#28

nbi1 escribió:

Tesseract detects text characters just fine, but it's inability to cope with figurine algebraic is a showstopper. It does detect individual figurine objects, but it does not interpret them in a consistent way. Three instances of a bishop occurring in the same paragraph could be interpreted in 3 different ways due to slight differences in the pixel layout. That makes the whole exercise useless as going through an entire document to correct misinterpretations is a non-starter. This problem could be solved by a filter function having the logic "interpret as x any object reasonably close to the specified image xyz". In other words be able to specify the attributes of what constitutes a match, but Tesseract does not have such a capability. As I'm not inclined to rewrite and/or enhance Tesseract I guess I'll need to look elsewhere.

I think I have thought one way to detect many slightly different instances of the same piece figurine with the same model.

I have to confirm that it works, but it seems a good way to do the job:

First of all, we have to process each figurine, and discard the blank regions at top, bottom, left and right.

If we consider that the result are high definition images (above 25 x 25), we then, would be able to resize those images to a low definition summary, let's say of 10x10 pixels.

We will suppose that we have done the same processing with the model figurines against which to compare.

Then, for the same piece figurines summaries (of 10x10 pixels), we will have a mean square error that ideally would be low enough to enable the detection of the right model figurine image.

In case we did not have high definition images for the figurines, then we could discard the blank regions in the same way as before. If the sizes of the input figurine and the model figurine are almost the same, we could do a kind of correlation between the images by moving the smallest image inside the bigest one pixel by pixel and calculating the mean square error, and then keeping the value with the least mean square error.

I will program it the next year, with the new version of ChessPdfBrowser, but may be somebody will want to check if it works.

frojasg1

Sep 10, 2021

0

#29

I have to say that I am not an OCR expert, and with the described procedure, the differences in luminancy would not be detected ... But it can be a starting point.

NM nbi1

Sep 10, 2021

0

#30