My attempt at scraping chess.com

Sort:
BronsteinPawn

Im homeless and need @erik to hire me. To prove myself worthy I will try scrape chess.com

My goal is to get the PGN of a game. I know you can just download it but no, I do it the hard way.

 

This is the chess.com game that I will try to scrape. The first game ever on chess.com servers!
https://www.chess.com/daily/game/1

 

I will be using Java and a library called JSOUP. I still have my failed attempt at scraping on Netbeans. It uses JSOUP to gather the HTML document from the webpage and then tries to extract the PGN from it.

 

It seems like I entered a depression episode and deleted all the code associated to getting the PGN tho so I will have to rewrite that. The very simple code to get the HTML document is working. It looks like this:

null

And what I get looks like this when I tell Java to print it out:

nullYou will have to zoom on that one but that is the HTML code!!

Wait for my next post where I explain my understanding of chess.com´s HTML structure and how I will try to scrape it!

BronsteinPawn

Post 2 - My understanding of Chess.com´s HTML code.

If I understand correctly the chess.com moves are displayed in a <div id=moveList_vertical> container. Inside the container there is a <span> per every move.

The <span> has an id with the number of the move its representing.

For example, 1. e4 would be <span id="movelist_1">, 3.Bb5 would be <span id="movelist_5">

And inside that span there is an <a> tag with a very similar id which contains the move itself. 

null

"Sf3" is Nf3

skelos

As productive might be an enhancement request to access a single game by ID.

If thought necessary, it can be cached (thus delayed a little) and/or not permitted for live games.

BronsteinPawn

Post 3 - Figuring out if there is any difference between the HTML that I am getting and the HTML on the web.

I am so dumb, this is shameful. Now that I think about it the HTML code that I get from JSOUP doesnt look like the one I get in the web. I will have to go trough the whole HTML code that JSOUP provides me with and try to look for the <div> containing the moves.

 

And that is my first problem. The HTML that google provides allows me to look deeper into the <div> that contains the moves. The HTML code that I get just shows it but does not allow me to see whats inside it. 

 

The HTML JSoup gives me:

null

 

HTML google chrome gives me:

null

Is this what they meant by "HTML scraping is not supported". Is this the end of this short thread? I hope not. Anyways, now that I found my first problem I will go silent for a while.

BronsteinPawn
skelos hat geschrieben:

As productive might be an enhancement request to access a single game by ID.

If thought necessary, it can be cached (thus delayed a little) and/or not permitted for live games.

That is a good idea. I like it! 

BronsteinPawn

So it seems the reason I am not seeing the other stuff in JSOUP is because it just gets HTML, not Javascript and I guess Javascript is what adds those moves. Supposedly Selenium will work. I am gonna try that now.

BronsteinPawn

Part 4 - Forget about the API we have Selenium

 

Just installed Selenium in Netbeans and its CRAZY. They were really serious when they said it was automatic!!! 

With their chrome driver and their navigate().to(yourUrl) method I can make it open a chrome tab and navigate directly to our game without having to do anything!
Oh my god lol, at this point I can probably just make it click on the download PGN button for me lol!

BronsteinPawn

Part 5 - Yep, I can click on it

Beautiful. I can click on it. The method is the same, you create an Element and asign its value by selecting something from the HTML. You can select it by id, class, xpath etc...

Google Chrome developer tools allow you to click on an object inside the HTML code and copy its XPATH adress so I went that way since its supposedly the most exact approach.

 

I was committing the rookie mistake of navigating to the page and then trying to click on it after that. I was getting a "Element not visible exception" and thought it was because the path is wrong, in reality is because my internet sucks and the webpage cant load in a milisecond. I fixed this by using:
Thread.sleep(timeInMiliseconds)

to let the page load before asking it to click the download button.

BronsteinPawn

Part 6 - I am now able to scrape any chess.com game that has ended.

I just needed 3 methods. One that navigates to the game, one that clicks on the download PGN button and one that retrieves the text from the TextArea on the popup with the PGN and parses it to a string. I will create a simple GUI and get on my way to scrape real time games which shouldn't be hard. 

 

 

bcurtis

We put a lot of effort into creating the API so that you wouldn't have to spend hours figuring this out — just use a URL and done. So I'm very curious about why you are doing this the hard way, which uses a lot of resources on your computer and takes our server much longer to handle.

BronsteinPawn
bcurtis hat geschrieben:

We put a lot of effort into creating the API so that you wouldn't have to spend hours figuring this out — just use a URL and done. So I'm very curious about why you are doing this the hard way, which uses a lot of resources on your computer and takes our server much longer to handle.

I want to scrape real time but I first need to do the basics with selenium. I also like to scrape so Im just playing around. As you mentioned this is not a "good" way to do it and since all of this is available trough the API it should be done there because if you modify the HTML this wont work anymore.

BronsteinPawn

Part 7 - Officially able to scrape live moves.

Since I dont know how much developers like my scraping stuff instead of scraping a live game I played around with:

https://www.chess.com/play/computer

As the chrome window opened by Selenium does not log you in. The HTML structure is pretty much the same so if I wanted I should be able to technically scrape real time. The way I did it is by just scraping the <a> that contains the move. The id for the tags changes per move by just adding one more number. 

 

The id for 1.e4 would be:

vertical_moveListControl_gotomoveid_0_1

 

While the id for 1...e5 would be:
vertical_moveListControl_gotomoveid_0_2

 

Once we hit move number 50 it would be:

vertical_moveListControl_gotomoveid_0_100

 

So basically to scrape it I just get the Element by id or XPath, in this case I use XPath but id should also work and repeat the task over and over with a Java timer. If it cant find the element because it wasnt played it just catches the exception generated by it and waits to repeat the task again. Havent tried it longterm to see how hard it hits the  resources.

 

To make it a full PGN I need to add the move number behind the move. So If I scrape e4 I need to add "1." before it. But if I scrape black´s moves I dont need to add "1."

 

I keep a variable called counter which keeps track of the move my program is trying to scrape. It is initially one since I am trying to scrape move 1, then it becomes 2 since once I have scraped move 1 I need move 2 and so forth. Since I keep this variable I just need to add a number if counter is odd and nothing if it is even.  I use a TextArea on JavaFx and just add the moves with the Append method.

 

Now I just need to find a way to add speech dictation to it. I will probably never maintain the code once I finish my goal so once I compile the .jar and post it on my beautiful empty github page if the chess.com HTML structure is modified whoever is interested in my creation will have to change what Selenium scrapes accordingly!

knightburgler

I started a scraping project on Chess.com several months ago. I was interested in member profile information. But during development, I ran into a robot.txt file asking me to go no further. I was respectful and dropped the project. Being respectful when web scraping is important.