[Exceptional C++ Style] Item 26 -- Data Formats and Efficiency

Lois Goldthwaite lois at loisgoldthwaite.com
Sun Jan 30 19:06:51 EST 2005


These are some additional comments that occurred to me when I read the 
article. The final remarks sort of anticipate the question in Item 27 ("Is it 
possible to store a chess game using fewer than 12 bits per half-move?") but 
do not duplicate the solutions listed for that item.

a) Solution a can be reduced to 64 bytes per half-move instead of 128 by using 
case to differentiate between white and black pieces. Thus, 'Q' is the white 
queen, 'q' is the black queen.

b) 

c) Actually, this format has a minimum of three bytes rather than four, 
because an unambiguous capture is often written as a simple "RxB" or "PxP". 
On the other hand, game scores often include remarks like 'e.p.' or 'ch' or 
'mate' or 'resigns' which would add to the space needed. 

A side note: I have *a lot* of chess books that use this notation throughout. 
I don't know when the "new-style" notation became the norm in the 
English-speaking chess world, but in writing up this item I was inspired to 
see if I could find out. An article on the History of Chess Notation 
(http://www.chessmuseum.org/history0799.html) relates that "new-style" 
algebraic notation was invented well before the abbreviated descriptive 
"old-style" notations evolved, but didn't catch on in England. And in the US 
(UK not mentioned in this article) the move to algebraic notation only began 
in the 1970s and took some years to become popular. So "old" and "new" are 
somewhat open to interpretation.

d) It is common, though perhaps not absolutely necessary, to add a few symbols 
like '+' for 'check' and '++' or '#' for checkmate. And what about the 
laconic '?' or '!!' annotations?

Apart from those frills, the notation for castling queenside is '0-0-0', which 
exceeds the four-byte limit. However, we could record castling as a move by 
the king, since it is only when castling that the king moves two squares at 
once, and take the corresponding rook move as implicit.

e) This scheme is officially known as "coordinate" notation.

We can squeeze this down to 11 bits per half-move by recording not the origin 
square, but which piece moved, plus its destination. White and Black each 
have 16 pieces, so all 32 can be unambiguously identified with a 
five-effective-cpp at accu.orgbit code. 

Since White and Black always move alternately, you could even reduce this to 
10 bits by considering only codes for the appropriate player's pieces. Here 
again, noting what piece a pawn promoted to would require a bit of extra 
space, but I'm assuming the new queen (or bishop, knight, or rook) could 
continue to keep the same identifier in the game. Come to think about it, 
it's rare for a pawn to promote to anything except a queen. If queen 
promotion were the assumed default, extra space in the game score would be 
needed only rarely.

Another idea for saving space would be to refer to a known database of 
previous chess games instead of recording all the moves of all the games 
individually. The Chessbase Opening Encyclopedia 2004 
(http://www.chesscentral.com/software/Opening_Encyclopedia.htm) contains a 
database of 1.8 million games, organised by opening. Using a few bytes to 
indicate "repeat the first n moves of standard game x" could save a lot of 
space in a hurry. Since the task involved designing a database to hold all 
the games in the world, even a large database of standard games would add no 
extra overhead to the system. 

My final suggestion is to use a bit of lateral thinking and find some other 
means to reduce the size of the chess database. An off-the-shelf compressor 
like bzip might result in more space-saving than all the bit-twiddling we can 
hack. Of course real tests on real files should be performed before making 
this decision.


Lois



More information about the Effective-cpp mailing list