How to store chess openings in a relational database?

Sort:
Oldest
Sqod

Follow-up...

In case I don't have time to post a more technical discussion today, here are the three stories I was going to tell about my experiences of people needing a graph analysis tool instead of a database:

 

(1) In 2003 I was on loan to a company that was working on a project to analyze communication patterns. They showed me a database of communications where each entity initiating or receiving the communication was identified only by a several-digit code number, along with the times the communication started and stopped. I was told that the actual data was classified, but that my general work and upcoming report was not. They told me they wanted to know what patterns I could find in the data. That was a fun project since it was so open-ended. It was as if they didn't even know how to approach the problem, and they were basically telling me "Tell us anything you can about this mess of data."

 

I began by asking myself obvious questions that related to the implied graph itself: Who was initiating the most communications? Graphically this would correspond to the node with the most *outgoing* links. This roughly corresponds to the concept of a "hub" in network science (https://en.wikipedia.org/wiki/Hub_(network_science_concept)), although the Wikipedia definition of this does not consider whether such links are outgoing on incoming. Who was receiving the most communications? Graphically this would correspond to the node with the most *incoming* links. Was everybody in a contact line with everybody else, meaning a message starting from one entity could have been broadcast to everyone, or were there disconnected groups that never had communication outside of their circle? Graphically that would correspond to independent, non-intersecting graphs. All such SQL queries were relatively easy to do.

 

The company had a SQL expert on whom I relied for the more difficult queries, especially those that required subqueries. Toward the end of my project I began coming up with more ideas relating to network topology, such as cycles, and also the concept of periodic patterns, such as a daily broadcast from a single entity. At this point the SQL expert balked, and said those would be very difficult queries for SQL, which to me suggested he didn't know how to do them. That made an impression on me: that SQL doesn't work well with graph-type domains that require following nodes around to determine network topology. In other words, SQL was the wrong tool for that job.

 

(2) Around 2011 a business I passed in La Jolla, California caught my eye because its name suggested it did software work with semantics and textual data, which was odd partly because that was the type of work I knew well, and partly because all the businesses in that area, which were only two blocks from the beach, catered to affluent tourists: Lamborghini and Porsche dealerships, restaurants, tuxedo stores, hotels, etc., so their business really seemed incongruous, suitable more for some tech center like in Silicon Valley. I went in to inquire about their business and found out that--sure enough--they did software work related to A.I. because their software system used Hebbian learning as biological neurons do. They were using Hebbian learning to analyze text, since Hebbian learning could cluster text.

 

I was familiar with Hebbian learning due to my A.I. background. In Hebbian learning a weight (w)--a single value--becomes associated with the link between every two nodes through repeated exposure (statistical learning). The man told me they might have work for me but that they were waiting on a "big check" that was expected soon from some big investors. Most typically he said their current work was in reading their clients' data files and converting it to an XML format so that the company's software could operate on it, so that is the work for which they could initially use me. If their software could find the types of patterns for which a perspective client wanted that would lead to a customized version of their software that they would sell to the client.

He told me that their company had been approached once by a military person to find out if the company's software could help analyze some complicated data that the military had. It was another case of "We have this mess of data. Can you tell us anything about it?" Unfortunatey, the man telling me this said their software was not very well tailored to such a general task, so when run with the data the company was supplied it did not satisfy the potential client's requirements.

Here is a rough visual comparison of the fundamental component (the top diagram) of a Hebbian learning system compared to the fundamental component (the bottom diagram) of a directed graph system:

 

Hebbian learning looks much simpler, doesn't it? It is, and I knew that a system based on that simple structure (no directionality, with limited data, and only simple numerical data) would be too simple to produce anything powerful enough to solve difficult problems like those in my bulletized list in my previous thread, all of which relied on directed graphs. I didn't say anything because I wanted the job, and besides I figured I could talk to their inventor into adding complexity if it were needed in the customization.

From what I just told you, most of you can figure out how the rest of the story: Their "big check" never arrived, they stopped responding to my e-mails, and soon I saw that their store disappeared, all of which suggests they went out of business, ultimately partly because their software could not tackle general graph analysis problems that people needed to solve.

(3) In 2010 I worked on a commercial project for prediction of heart attacks. Given the standard, non-invasive body measurements that doctors routinely gather--namely blood pressure, heart rate, and body temperature--our task was essentially to use data from known heart attack victims to predict whether they were going to have a heart attack. As you can imagine, this was an extremely difficult task. After all, if it were that simple, somebody likely would have already figured that out long ago and been using it by now.

One of the key problems for us was that there was no publicly available model of the human body, so nobody, not even doctors, could tell us what would happen if one value rose while another fell, for example. Without some mathematical model the data could not be mapped to a diagnosis, and statistical methods, even A.I.-based, seemed to be at a loss, probably because there were no evident patterns (plus we figured we needed more data from more invasive techniques). Eventually we found out that there did exist some makeshift models in the form of artificial human bodies on which doctors trained, but such state-of-the-art training aids and the programmers who programmed them were unknown and unavailable to us.

Our software people discussed the problem and I lamented that no general system analysis software seemed to exist whereby an engineer could just attach sensors to a given system--whether an aircraft component, human body, or 3-variable math equation--and have the software figure out how it worked based on empirical evidence, and maybe propose a model. This seemed yet another case of "We have this system that can provide us data, but how can we interpret patterns from it?" It didn't seem that hard to produce such a system, but maybe it was. Even if the human body had some component analogous to a capacitor in an electrical circuit that delayed the effect of a change, it seems that its presence could be detected or at least hypothesized by looking at enough sensor readings over time.

Anyway, those are my stories. In just three events the picture that seems to be emerging is: (1) Not even the military has a software system that can tackle the general problem of analyzing directed graph data, which suggests that no such software exists (although these experiences were admittedly at least several years ago); (2) Such a software system is much needed in every field, whether in chess, medicine, or warfare.(3) SQL databases are the wrong tool for the job, especially in chess move analysis.

Of course I could be wrong about all this, but this set of stories is party why I'm getting pessimistic about a database doing what I want for chess. Anyway, enough on that conjecture. Later I'll post more specific examples of the types of information I want from chess game analysis, and see if anybody can figure out the more concrete, down-to-earth problem of supplying a SQL statement that will accomplish that.

Sqod

Update...

Per private correspondence, one person here confirmed my suspicion was correct: relational databases are absolutely the wrong tool for what I want to do, so I gave up on this project... sort of. I'm still planning to build a very simple smaller database for experimentation (such as for holding structured data and positional attributes of interest) but I'm not going to even try to get a conventional database to do tree searches. The headaches and contortions are pointless. I'd like to start a blog here, if I can ever justify the time, and probably my first article will be about some insights I got from trying to force a conventional (= relational) database to do what I want with chess moves. Maybe someday I'll find some software tool that is focused on search of trees or directed graphs, which I might be able to modify for chess needs, but that or writing my own graph-searching software are about the only things that would motivate me to tackle such a project again.

 

 

skelos

Interesting thread. Thanks to all who posted. I will now be wary about trying to do anything similar to what the original poster intended with a relational database. happy.png (I'm no database guru, and know it; I do know more about them than some self-proclaimed experts I've worked with. Always ... challenging, that.)

As a marginally related point: I found a position where chess.com's opening database incorrectly "recognised" a transposition: it failed to take into account the option of an en passant capture. I wonder what other opening databases share this limitation, but not very much and merely keep it in mind. Given a database which has somehow "lost" en passant information I imagine fixing it would require rebuilding from original games. Yuck!

mgx9600
Hang in there! I'm still on my chessboard project too. Currently at the HW/SW integration stage; hope to see PGNs soon.
mgx9600
WRT open database not accounting for en passant, I might have some insight based on my personal experience. I wrote a chess engine a while back (one of my first chess projects since getting into the game). My engine tried to steer towards known opening positions in the beginning so that it can play book moves (and traps). Part of the decision I had to make was whether to ignore castling and en passant on the positions. It's not hard to either ignore them or account for them with my program since it's only a matter of how far to terminate the FEN string. So..... the open DB authors may also had that same decision.
skelos

@mgx9600 very interested in your project. DGT need the competition, and someone to design some non-ugly pieces for them. (Ugliness, like beauty, being in the eye of the beholder, but the only DGT-capable set I've liked was a special from Noj. A set that matched FIDE's own specifications size-wise would be an interesting design challenge for DGT, too.)

Back to your hardware: have you published any information yet?

mgx9600

One difference is my chessboard will allow you to use your own chessboard and pieces.  It started as an underlay that detects weighted pieces (ferrous weights with hall effect sensors); same size as 2.25" square boards.  There were some issues like lining up the sensors pad under the board (and different board thickness), but the main one was not able to properly detect sliding pieces.  Also, I did not like that it must use ferrous material.  I've changed it to an overlay on photocells, which seems to solve those problems.

 

Here's the good part.  Total material cost is about $20 per board.  Lots of intelligence in the software; for example, because it doesn't know the specific chessman, it uses guesses (e.g. based on move like knights moves in an L) and captures are also guessed (e.g. a white pawn on d4 and balck pawns on e5 and c5, when the white pawn is gone on white's turn, and it did not reappear on d5, then it must have captured one of those black pawns).  The software produces many branches of the game, but they converge (e.g. if the white pawn makes a move from e5, then we know it didn't capture on c5, i.e. 2 possibilities converged into 1).

 

Turn detection is pretty good, but it has problems with understand illegal moves.  Also, discontinuous sliding where the wrong piece is slide, then the right piece is slided.  Tempted to put in some "reset" type button input; but right now, hoping to auto detect with SW changes.

 

We shall see.

mgx9600

opps, I forgot to add in the processing unit or power supply into the cost : )  Which I've not spec'ed it out yet (currently using a dedicated microcontroller module way too fancy for this thing).  Also, I've not cared about its power requirements (which with battery, it may have some added cost to lower power consumption).  Also, I only have a 3x3 board, so I think the board cost is likely $100.  : )  I knew I forgot something : )

 

betadmdragon

Sorry to necro this thread, but this sort of project interests me, would be useful, and honestly all the concerns brought up have very easy solutions. To get something like stats on when white plays Nc3 after black fianchetto's etc is very easy. You have a unique FEN table with tags describing the position like fianchetto's, open files, etc. You would simply do a merge function on a query against the starting move order that you care about against the unique FEN table where the FENs meet your tag criteria.

If you need to restrict if further so that you only get the games where a certain move happens before or after the other then you take that result and merge it back against the opening database with the move number for a specific move that is higher or lower than the move number from the first result.

If I played with it a while I could find a more efficient process but that's from the top of my head as someone who's worked with databases doing similar things in the past.

Forums
Forum Legend
Following
New Comments
Locked Topic
Pinned Topic