[Exceptional C++ Style] Item 26 -- Data Formats and Efficiency

Sun Jan 30 19:05:10 EST 2005

Summary of Sutter Item 26:

Data Formats and Efficiency, Part 1:
When Compression Is the Name of the Game

A basic understanding of chess is assumed.

JG Question:
1. Which of these standard containers uses the least memory to store the same 
number of objects of the same type T: deque, list, set, or vector? Explain.

Guru Question
2. You are creating a worldwide chess server to store all games ever played. 
Demonstrate a format for representing a chess game that requires the 
indicated amount of data to store each half-move (one move played by one 
side). Assume each byte has 8 bits.

a) 128 bytes per half-move
b) 32 bytes per half-move
c) minimum 4 bytes and maximum 8 bytes per half-move
d) minimum 2 bytes and maximum 4 bytes per half-move
e) 12 bits per half-move

Solutions
1. Each kind of container in the standard C++ library chooses a different 
trade-off of space vs performance.

- A vector<T> stores T objects in a contiguous array and so has no per-element 
overhead at all.

- A deque<T> stores T objects in chunks or "pages", each chunk containing a 
portion of a (notionally) contiguous array. This typically requires one extra 
pointer of management information per chunk, but that works out to only a 
fraction of a bit in per-element overhead.

- A list<T> is a doubly-linked list of nodes that hold T elements. This means 
that each T object added to list also brings along two pointers in overhead.

- A set<T> (and also multiset<T>, map<T>, and multimap<T>) is usually 
implemented as a tree structure with three pointers for each element added. 

Part of choosing an efficient in-memory representation of data is choosing the 
most space- and time-efficient container which supports the functionality you 
need. Another part of the task is determining how to represent the data that 
will be put into those containers, hence the second part of this Item.

2. What data format will represent a half-move in:

a) 128 bytes: Store the whole board position after each half-move, listing the 
contents of each of the 64 squares. In this scheme, rank one at the beginning 
of the game would be rendered as "WRWNWBWQWKWBWNWR". Black pieces are 
prefixed with "B"; unoccupied squares are denoted by '.'.

b) 32 bytes: Keeping the basic strategy of storing the whole board position, 
we can represent each square in only four bits: three bits to indicate the 
piece (0=K, 1=Q, 2=R, 3=B, 4=N, 5=P, 6=none) plus one bit to store the 
colour.

c) 4 to 8 bytes: We can achieve this by using the old-style "descriptive" 
chess notation: P-K4, RKN1-KB1, P-KN8(Q). No extra trailing null or other 
delimiter is needed, because the move format can be decoded unambiguously.

d) 2 to 4 bytes: Each half-move can be recorded as text in modern "algebraic" 
chess notation, which requires a minimum of two characters (d4) and a maximum 
of 4 (Rgf1, gh=Q). Again, no special move delimiter is needed.

e) 12 bits: Since the starting position of each piece is fixed, we could 
encapsulate a half move by simply recording the origin and destination square 
every time a piece moves. Since there are 64 squares, it takes only six bits 
to represent one, and therefore 12 bits to represent both. However, recording 
a pawn promotion would require some additional space beyond the 12 bits.

Lois