[Exceptional C++ Style] Item 26 -- Data Formats and Efficiency
Lois Goldthwaite
lois at loisgoldthwaite.com
Sun Jan 30 19:06:51 EST 2005
These are some additional comments that occurred to me when I read the
article. The final remarks sort of anticipate the question in Item 27 ("Is it
possible to store a chess game using fewer than 12 bits per half-move?") but
do not duplicate the solutions listed for that item.
a) Solution a can be reduced to 64 bytes per half-move instead of 128 by using
case to differentiate between white and black pieces. Thus, 'Q' is the white
queen, 'q' is the black queen.
b)
c) Actually, this format has a minimum of three bytes rather than four,
because an unambiguous capture is often written as a simple "RxB" or "PxP".
On the other hand, game scores often include remarks like 'e.p.' or 'ch' or
'mate' or 'resigns' which would add to the space needed.
A side note: I have *a lot* of chess books that use this notation throughout.
I don't know when the "new-style" notation became the norm in the
English-speaking chess world, but in writing up this item I was inspired to
see if I could find out. An article on the History of Chess Notation
(http://www.chessmuseum.org/history0799.html) relates that "new-style"
algebraic notation was invented well before the abbreviated descriptive
"old-style" notations evolved, but didn't catch on in England. And in the US
(UK not mentioned in this article) the move to algebraic notation only began
in the 1970s and took some years to become popular. So "old" and "new" are
somewhat open to interpretation.
d) It is common, though perhaps not absolutely necessary, to add a few symbols
like '+' for 'check' and '++' or '#' for checkmate. And what about the
laconic '?' or '!!' annotations?
Apart from those frills, the notation for castling queenside is '0-0-0', which
exceeds the four-byte limit. However, we could record castling as a move by
the king, since it is only when castling that the king moves two squares at
once, and take the corresponding rook move as implicit.
e) This scheme is officially known as "coordinate" notation.
We can squeeze this down to 11 bits per half-move by recording not the origin
square, but which piece moved, plus its destination. White and Black each
have 16 pieces, so all 32 can be unambiguously identified with a
five-effective-cpp at accu.orgbit code.
Since White and Black always move alternately, you could even reduce this to
10 bits by considering only codes for the appropriate player's pieces. Here
again, noting what piece a pawn promoted to would require a bit of extra
space, but I'm assuming the new queen (or bishop, knight, or rook) could
continue to keep the same identifier in the game. Come to think about it,
it's rare for a pawn to promote to anything except a queen. If queen
promotion were the assumed default, extra space in the game score would be
needed only rarely.
Another idea for saving space would be to refer to a known database of
previous chess games instead of recording all the moves of all the games
individually. The Chessbase Opening Encyclopedia 2004
(http://www.chesscentral.com/software/Opening_Encyclopedia.htm) contains a
database of 1.8 million games, organised by opening. Using a few bytes to
indicate "repeat the first n moves of standard game x" could save a lot of
space in a hurry. Since the task involved designing a database to hold all
the games in the world, even a large database of standard games would add no
extra overhead to the system.
My final suggestion is to use a bit of lateral thinking and find some other
means to reduce the size of the chess database. An off-the-shelf compressor
like bzip might result in more space-saving than all the bit-twiddling we can
hack. Of course real tests on real files should be performed before making
this decision.
Lois
More information about the Effective-cpp
mailing list