What we need is a trainable OCR program which creates a map of input symbols to output PGN. I've done a fair amount of research on this, but have not found anything suitable. Everything I've found is badly flawed in capabilities and/or implementation. What I would like to see is an OCR engine with a GUI interface that lets you train the OCR by opening a PDF document and selecting source symbols (either figurine or text). Whenever a symbol is selected a popup dialog prompts the user for the corresponding output PGN (for example a figurine queen would get mapped to 'Q'). You get the idea. The complete training may be a bit tedious, but it isn't as bad as it seems. The complete set of mapped input would include text, figurines, and ECO symbols. That's really not a large mapping set. I don't care about diagrams because they are trivial to construct from the PGN. When you consider the time savings achieved over manually processing an entire document it's obvious that such a trainable OCR would be a huge boon to productivity. Another huge plus is that the OCR would not need to be re-trained for other documents from the same publisher as they are likely to use the same font and symbol sets. I find it astonishing that in 2021 no such application exists. Yes, it's a fairly intense development effort, but this is one of those apps that people would gladly pay for especially since the OCR portion doesn't need to be chess specific. I've been accumulating much chess PDF material and am getting rather exasperated over not being able to convert it to PGN so maybe I'll take a stab at this project.
Convert pdf into a chess format

Hello nbi1,
Yes, what you have explained in the previous message was nearly my idea when I analyzed how to program it...
But I have lack of hours in the day (days would be better if they had 36 hours :-D)
I have just started the development of a new application (a Java music player), so at the moment I will not have time to start that new version of ChessPdfBrowser.
If you are familiar with Java development, some months ago I started to work on a Pdf library that extracts the images from characters. The library is finished, and ready to be used, but there is still a lot of work to have the OCR + the modified text extractor that takes into account those images.
I have analyzed it and I have found that it is affordable, but it is not a trivial task, so I will have to work a lot of hours in order to have the new version of ChessPdfBrowser finished.
That will be probably next year.
At the moment, if you want to take a look on the Pdf library, you can download it from:
https://www.frojasg1.com:8443/downloads_web/downloadServletv3?file=pdfInspector.v1.0&origin=chess.com&language=English

Hello nbi1,
Yes, what you have explained in the previous message was nearly my idea when I analyzed how to program it...
But I have lack of hours in the day (days would be better if they had 36 hours :-D)
I have just started the development of a new application (a Java music player), so at the moment I will not have time to start that new version of ChessPdfBrowser.
If you are familiar with Java development, some months ago I started to work on a Pdf library that extracts the images from characters. The library is finished, and ready to be used, but there is still a lot of work to have the OCR + the modified text extractor that takes into account those images.
I have analyzed it and I have found that it is affordable, but it is not a trivial task, so I will have to work a lot of hours in order to have the new version of ChessPdfBrowser finished.
That will be probably next year.
At the moment, if you want to take a look on the Pdf library, you can download it from:
https://www.frojasg1.com:8443/downloads_web/downloadServletv3?file=pdfInspector.v1.0&origin=chess.com&language=English
Thanks. I think we're on the same wavelength with regards to the effort required to implement this. Recently I've been viewing the most up to date Tesseract documents to get a sense of any improvement since I last looked at Tesseract some 10 years ago. Sadly it's as painful and tedious now as it was back then. I naively hoped someone might have created a language file for chess figurine algebraic, but no such luck. I also considered a design without an OCR base such as an implementation using ImageMagick, but that is a herculean task which would likely yield a fragile one-off solution. I congratulate you for recognizing the need and working on a solution. In all the years that PGN has been around you seem to be the first developer to tackle this project. I completely understand the time demands made by such an undertaking. The bad state of affairs with Tesseract and other OCR software pretty much guaranteed that I couldn't get involved. I do have more time now so if I'm feeling ambitious or just struggling with insomnia I might take another stab at Tesseract.

Tesseract detects text characters just fine, but it's inability to cope with figurine algebraic is a showstopper. It does detect individual figurine objects, but it does not interpret them in a consistent way. Three instances of a bishop occurring in the same paragraph could be interpreted in 3 different ways due to slight differences in the pixel layout. That makes the whole exercise useless as going through an entire document to correct misinterpretations is a non-starter. This problem could be solved by a filter function having the logic "interpret as x any object reasonably close to the specified image xyz". In other words be able to specify the attributes of what constitutes a match, but Tesseract does not have such a capability. As I'm not inclined to rewrite and/or enhance Tesseract I guess I'll need to look elsewhere.

A poster recommended https://ebook.chessvision.ai. We discussed it here.
https://www.chess.com/forum/view/general/forward-chess-chess-studio-e-chess-books-chess-king-chessable?page=2
We have yet to back from more chess.com users for feedback. It won’t convert your legally-owned PDFs to pin files that you upload to your own database or engine, but you can use their interface for analysis and to run Stockfish on.
Thanks for your help.
This just picks off the diagrams from a PDF. While that is a very helpful feature it's not what some of us are after which is converting figurine algebraic notation to PGN. I really don't care about the diagrams because if you have the PGN it's easy enough to insert/generate diagrams at any point in the game. It's the generation of the PGN that's the primary issue.

Tesseract detects text characters just fine, but it's inability to cope with figurine algebraic is a showstopper. It does detect individual figurine objects, but it does not interpret them in a consistent way. Three instances of a bishop occurring in the same paragraph could be interpreted in 3 different ways due to slight differences in the pixel layout. That makes the whole exercise useless as going through an entire document to correct misinterpretations is a non-starter. This problem could be solved by a filter function having the logic "interpret as x any object reasonably close to the specified image xyz". In other words be able to specify the attributes of what constitutes a match, but Tesseract does not have such a capability. As I'm not inclined to rewrite and/or enhance Tesseract I guess I'll need to look elsewhere.
I think I have thought one way to detect many slightly different instances of the same piece figurine with the same model.
I have to confirm that it works, but it seems a good way to do the job:
First of all, we have to process each figurine, and discard the blank regions at top, bottom, left and right.
If we consider that the result are high definition images (above 25 x 25), we then, would be able to resize those images to a low definition summary, let's say of 10x10 pixels.
We will suppose that we have done the same processing with the model figurines against which to compare.
Then, for the same piece figurines summaries (of 10x10 pixels), we will have a mean square error that ideally would be low enough to enable the detection of the right model figurine image.
In case we did not have high definition images for the figurines, then we could discard the blank regions in the same way as before. If the sizes of the input figurine and the model figurine are almost the same, we could do a kind of correlation between the images by moving the smallest image inside the bigest one pixel by pixel and calculating the mean square error, and then keeping the value with the least mean square error.
I will program it the next year, with the new version of ChessPdfBrowser, but may be somebody will want to check if it works.

I have to say that I am not an OCR expert, and with the described procedure, the differences in luminancy would not be detected ... But it can be a starting point.

Tesseract detects text characters just fine, but it's inability to cope with figurine algebraic is a showstopper. It does detect individual figurine objects, but it does not interpret them in a consistent way. Three instances of a bishop occurring in the same paragraph could be interpreted in 3 different ways due to slight differences in the pixel layout. That makes the whole exercise useless as going through an entire document to correct misinterpretations is a non-starter. This problem could be solved by a filter function having the logic "interpret as x any object reasonably close to the specified image xyz". In other words be able to specify the attributes of what constitutes a match, but Tesseract does not have such a capability. As I'm not inclined to rewrite and/or enhance Tesseract I guess I'll need to look elsewhere.
I think I have thought one way to detect many slightly different instances of the same piece figurine with the same model.
I have to confirm that it works, but it seems a good way to do the job:
First of all, we have to process each figurine, and discard the blank regions at top, bottom, left and right.
If we consider that the result are high definition images (above 25 x 25), we then, would be able to resize those images to a low definition summary, let's say of 10x10 pixels.
We will suppose that we have done the same processing with the model figurines against which to compare.
Then, for the same piece figurines summaries (of 10x10 pixels), we will have a mean square error that ideally would be low enough to enable the detection of the right model figurine image.
In case we did not have high definition images for the figurines, then we could discard the blank regions in the same way as before. If the sizes of the input figurine and the model figurine are almost the same, we could do a kind of correlation between the images by moving the smallest image inside the bigest one pixel by pixel and calculating the mean square error, and then keeping the value with the least mean square error.
I will program it the next year, with the new version of ChessPdfBrowser, but may be somebody will want to check if it works.
I was thinking along similar lines. Tesseract is very good with standard text, but struggles with symbols. So I was wondering if there is a way to mitigate this via image processing. ImageMagick to the rescue! The strategy is to find all the instances of figurines/symbols and replace them with their equivalent text. The resulting all text image(s) can then be parsed with the default Tesseract language support. This approach is very declarative in that the sub images to be looked for and their replacement text need to be specified manually (I'm using Gimp to do this) and that is rather tedious, but I'm just in the proof-of-concept phase. Once I have the various pieces stitched together into a working solution I'll focus on identifying or creating a tool to make the figurine/symbol declaration easier so this isn't an ordeal whenever a new source (different fonts, different symbols) needs to be parsed. Unfortunately there are other things of a higher priority making demands on my time so my progress on this may be slow.

Hello,
may be it arrives a little late, but I have just released a new version of an application that does the job (ChessPdfBrowser v1.20)
If you are curious about it, you can go to the official site (is the following):
https://chesspdfbrowser.com?origin=chess.com
I hope it is useful to anybody.
I love the concept but would like to have something like this as an app for iPad!!

Hi,
a new version of chessPdfBrowser (v1.26) has just been released.
with this new version
* you will be able to extract games of PDFs in figurine algebraic notation.
* It also includes some improvements and some bug fixes.
You can read more and download the application at:
https://chesspdfbrowser.com/?origin=chess.com

Sorry, the first binary had a horrible bug due a inconsistency at dependencies.
Now some bugs have been fixed.
You can download the new binary
https://chesspdfbrowser.com/?origin=chess.com

Sorry, the first binary had a horrible bug due a inconsistency at dependencies.
Now some bugs have been fixed.
You can download the new binary
Sorry, but it doesn't seem to do anything. As a test I tried to scan the first game of "My 60 Memorable Games". The scan completes without any error messages, but no output is generated. So I thought it must be processing, but when I checked my task monitor there was nothing going on. So it just seems to fail silently. Could it be blowing up on the figurine algebraic in the source pdf? I noticed there's a "configure figurine" button on the scan page which opens 2 windows: pdf view of the source and "select figurine detector....". The latter already has a detector selected which I can't change. So my guess was that this part of the software is for manually training the OCR on the figurine symbols, but the pdf viewer has no capability for doing that as far as I can tell. So I don't know what's going on - cockpit error, bugs, or both?

Hi nbi1,
The latest chessPdfBrowse version has a number of bugs, and I now I am working on a new release that hopefully will fix most of those errors.
I do not know what is happening with your pdf, but I think that the application would have had returned some kind of result of the chess games scan.
If you are interested on extracting the games of your pdf, and if that pdf does not have any copyright, you can share it with me, and then I will able to see what is happening.
And if I find any bugs, I will try to fix them before the new release is delivered.
The fastest way would be to share the pdf with me.
You can send it to this e-mail address: frojasg1@hotmail.com
I hope that helps

After some weeks of hard work, a new release of the application is available.
Some good improvements have been done at game extraction, achieving a success rate of the extraction of about 80 - 85% (number of games without illegal moves after the extraction)
You can download the new release at the following link:
Download the new version of ChessPdfBrowser
I am planning to program some new improvements, which will be available in future releases:
* A new function for extracting all diagrams in a range of pages, associating them with their game metadata if any.
* A new function for showing the position of the selected move at the Pdf navigator window, with the possibility to save the extracted games with that extra information in a new file format.
That will easy to navigate through the game moves and identifying them into the PDF
Please, feel free to comment anything you think is interesting with the programmer.
You can e-mail to me at: frojasg1@hotmail.com

Hi,
I wanted to let you know that I have just made available a new release of the application, that adds the function of showing the current move at the Pdf window, and offers the possibility to select a move from the Pdf window to be shown at the chess board.
There is a demo video showing how the new functionality works:
Demo of the latest new functionality
The new version already included the function for extracting games in figurine algebraic notation
There is another demo video that shows how to make it work:
Demo of the figurine algebraic notation games extraction
You can download the new version of the application at here:
Download the new version of ChessPdfBrowser

Hello, chess players!
a new version of the ChessPdfBrowser application is available.
It adds some new functions, including:
- Extraction of chess diagrams (demo)
- Possibility to open more than one file of chess games.
- Some improvements at the game list table (as a new option for sorting by a column, or another for flitering games)
If you want to download it, you can visit this website:

Hi again,
Yesterday a new delivery of ChessPdfBrowser.v1.26 has been released.
Now it works better, as the pdf library version used has been updated to its latest release.
Besides that, a new page layout detector has been programmed, which might work better with more complex layouts.
The position recognizer has been improved to include a basic check of whether the training data is wrong or not, thus being able to discard the trainings with detected wrong data, which is definitely good news for trying to avoid wrong detections of the Fen position recognizer.
The diagram extractor has also been improved, giving the possibility to limit the number of text lines processed before and after the diagram to determine the player turn.
In addition, the command line version of the application has been updated, making it easy to automate chess game extraction.
For instance, I have tested with a set of more than 90 Pdfs, and all results have been produced in less than two hours, with no other user actions.
It is true, that after that automatic extraction, manual check might be needed in order to train more accurately the FEN position recognizer on all Pdfs, and then to repeat, if needed, the process of automatic extraction, which hopefully will leverage the manual FEN position recognizer training, and hopefully will produce better results.
Now I consider that the application works acceptably well, and although not all the extractions are perfect, I think they are quite good as a first approximation that might be refined manually.
If you are interested, you can download the application from this web site:

Hi chess lovers,
A new version of ChessPdfBrowser.v1.27 has been released
Now the user interface is translated into:
- English (native)
- Spanish (native)
- Catalan (native)
And automatically translated into ten new languages:
- French
- German
- Portuguese
- Italian
- Greek
- Russian
- Japanese
- Chinese
- Hindi
- Arabic
Feedback about the automatic translations is welcome
You can download it at:
The application does not work with figurine notation at the moment, but I am planning to improve the application to manage that kind of notation too.
May be it will be for next year.