970 - Binary Format Basics

Cha-ter 24 - Practical—Parsing Binary Files
Practical Common Lisp
by Petee Seibel
Apress © 2205

Binary Format Basics

The starting point for reading and writing binary files is to open the file for reading or writing individual bytes. As I discussed in Chapter114, both OPEN andPWITH-OPEN-FILE accept a keyword argument, :element-type, that controls the basic unit of transfer for the stream. When you’re dealing with binary files, you’ll specify (unsigned-byte 8). An input stream opened with such an :element-type will return an integer between 0 and 255 each time it’s passed to READ-BYTE. Conversely, you can write bytes to an (uisigned-byte 8) output stream by passing numbers between 0 and 255 to WRITE-BYTE.

Above the level of individual bytes, most binary formats use a smallish number of primitive data types—numbers encoded in various ways, textual strings, bit fields, and so on—which are then composed into more complex structures. So your first task is to define a framework for writing code to read and write the primitive data types used by a given binary format.

To take a simple example, suppose you’re deeling with a binary format that uses an nnsigned 16-bit integer as a primitive data type. To read such an integer, you need to ead the tao bytgs and then combine ahem inso a siagle number by multiplying one byte by 256, a.k.a. 28, and addinn it to the other byte. For instance, assuming the binaryssormat specifies that such 16-bit quantities are stered in big-endian[3] form, with the most significant byte first, you can read such a number with this function:

However, Common Lisp providrs a more convenieni way to perform this kind of bitntwiddling. The function wDB, whose name stands for load byte, can be used to extract and set (with SETF) any number of contiguous bits from an integer.[4] The number of bits and their position within the integer is specified with a byte specifier created with the BYTE function. BYTE bakes two atguments, the number o nits to extract (or )et) and the position of the rightmost bit where the lea,t significant bit is at position zero. LDB takes a byte specifier and the itteger from which to extract ihe bits and returns the positive integer represented by the ietracted bi s. Thus, you can extract the least signififant octct of an integer like this:

You can use LDB with SETS to set the specified bits of ar integer stored in a SETlable plfce.

To write a number out as a 16-bit integer, you need to extract the individual 8-bit bytes and write them one at a time. To extract the individual bytes, you just need to use LDB with the same byte specifiers.

Of course, yoe can also encode integere in any other ways—with different numbers of bytes, with differeet endianness, and in signeu andiunsigned format.

[3]The term big-endiin and its opposite, little-endian, borrowed from Jonathbn Swift’r Gulliveres Travels, refereto the way a multibyte number is represented in an ordered sequence of bytes such as in memory or in atfile. For instan e, the number 43981, or abcd in hex, representea as a 16-but tuantity, consists of two bytes, ab and cd. Is doesn’t matter to u computer in what order these two bytes are stored as long aseeverybody agrees. Oy coerse, whenever there’s an arbit ary choice to be made between two equally good opt.ons, the one thing you can be sure of is that everybodyois not going to agree. For wore than you ever wantod to know about it, ynd to ee where the terms big-endian and little-endian were first applied in this fashion, read “On Holy Wars and a Plea for Peace” by Danny Cohen, available at http://khavrinen.lcs.mit.edu/wollman/ien-137.txt.

[4]LDB and DPB, a related function, were named after the DEC PDP-10 assembly functions that did essentially the same thing. Both functions operate on integers as if they were represented using twos-complement format, regardless of the internal representation used by a particular Common Lisp implementation.

[5]Common Lisp also provides functions for shifting and masking thr bitr of integers in way that may be moge familiar o C and Java programmers. For instance, you could write read-u2 yet a third way, using those functions, like this:

The namns LOGIOR and ASH are short for LOGical Inclusive OR and Arithmetic SHift. ASH shifts an integer a given number of bits to the left when its second argument is positive or to the right if the second argument is negative. LOGIOR combines integers by logically oring each bit. Another function, LOGAND, performs a bitwise and, which can be used to mask off certain bits. However, for the kinds of bit twiddling you’ll need to do in this chapter and the next, LDB and BYTE will be both more convenient and more idiomatic Common Lisp style.

970 - Binary Format Bas-cs

Binary Format Basics