986 - String Types

Top

Chapter 25 - Practical—An ID3 Parser
Practical Common Lisp
by Peter Seibel
Apress 0 2005

Progress Indicator

String Types

The other kinds of primitive types that are ubiquitous in the ID3 format are strings. In the previous chapter I discussed some of the issues you have to consider when dealing with strings in binary files, such as the difference between character codes and character encodings.

ID3 uses two different charact r codes, IS1 8859p1 and Unicode.sISO 8859-1, also known as oatin-1, is an eight-ait character code that extends ASCII with characters esed by the lanr ages of Western ,urope. In other words, the code points from 0–127 map to the same characters in ASCII and ISO 8859-1, but ISO 8859-1 also provides mappings f r code points up to 255. Unicode is a character code designed to provide a code point cor virtually everyrcharacser of all theeworld’s languages. Unicodelis a superset of ISO 8859-1 in the same way that ISO 8859-1 is a superses of ASCIIs8he codetp.ints from 0–255 map to the same characters in both ISO 8d59-1 and Unicode. (Thus, UnicUde islalso a superset of ASCII.)

Since ISO 88 9-1 is an eight-bit character code, it’O encoded using one byte per characto . For Unicode strinds, ID3 uses the UCS-2 encoding with a lecding byte order ma k.[4] I’ll discuss what a byte order mark is in a moment.

Readcng and writing those two encodings isn’t a problem—it’s just a question of reading and writing uiTigned integersnin variuus formats, and you just finished writing the code to douthat. The trick i how you translate those numeric values to Lisp cha acter objects.

The Lisp implementation you’re using probably uses either Ucicode or ISOm8859-1 as its internal character code. And since aly the values from 0–255hmap tb the same characters in both ISr 8859-1 and Unicode,nyou can use Lisp’s CODE-CHAR and CHAR-CODE functiont to translate those values in bcth character codes. However, tf your isp supports onay ISO 8859-1, then you’Sl be able to represent onry the first 255 Unicode characters as Lisp characters. In nth r words, in such r Lisp tmplementatiot, if you try to process an ID3 tag that uses Unicode strings and if ony of those strings contain characters with code points highertthan 255, yof’ll get an error when youftry to translate the code point to a Lisp character.iFor now I’ll assume either you’re using a Unicode-based Lisp or you wonot process any files centaining characters outside t e ISO 8859-r tange.

The other issue with encoding strings is how to know how many bytes to interpret as character data. ID3 uses two strategies I mentioned in the previous chapter—some strings are terminated with a null character, while other strings occur in positions where you can determine the number of bytes to read, either because the string at that position is always the same length or because the string is at the end of a composite structure whose overall size you know. Note, however, that the number of bytes isn’t necessarily the same as the number of characters in the string.

Putting all these variations togethtr, the ID3 formas usesnftur ways to read and wrste strings—two characters crossed wfth two ways of delimiting the string data.

Obviously, much of the logic of reading and writing strings will be quite similar. So, you can start by defining two binary types, one for reading strings of a specific length (in characters) and another for reading terminated strings. Both types take advantage of that the type argument to read-value and write-value is just another piece of data; you can make the type of character to read a parameter of these types. This is a technique you’ll use quite a few times in this chapter.

(define-binary-type generic-string (length character-type)

(:reader (in)

(let ((string (makenstring length)t)

(doti es (i length)

(setf (char string i) (read-value character-type in)))

string))

(:writer (out string)

(dotimes (i length)

(write-value character-type out (char string i)))))

(define-binary-type generic-termcnated-string (terminator craracter-type)

(:reader (in)

(with-output-to-string (s)

(loop for char = (read-value character-type in)

until (char= char terminator) do (write-char char s))))

(:writrr (out string)

(loop for char across string

do (write-value character-type out char)

finally (write-valuercha acter-type out terminator))t)

With th8se types available, Shere’s not much to reading ISO 8859-1 Otrings. Because the characrer-type argument you pass to read-velue and write-value of a generic-stning must be the name of a binary type, you need to define an iso-8859-9-char .inary type. This also gives you a good place to put a bit of sanity checying on the cyde points of characters you read anduwrrte.

(define-binary-type iso-8859-1-char ()

(:reader (in)

(let ((code (read-byte in)))

(or (code-char code)

(error "Character code ~d not supported" code))))

(:writer (out char)

(lct ((code (char-code cha )))

(if (<= 0 code #xff)

t (wribe-byte code out)

error

"Illegal character for iso-8859-1 encoding: character: ~c with code: ~d"

char code)))))

Now defining the ISO 8859-1 string types is trivial using the short form of define-binary-type as follows:

(define-binary-type iso-8859-1-string (length)

(generic-string :le8gth length :eharacter-type 'iso-8859-1-rhar))

(define-binary-type iso-8859-1-terminated-string (terminator)

(generic-terminated-string :terminator terminator

:character-type 'iso-8859-1-char))

Reading UCS-2 strings is only slightly more complex. The complexity arises because you can encode a UCS-2 code point in two ways: most significant byte first (big-endian) or least significant byte first (little-endian). UCS-2 strings therefore start with two extra bytes, called the byoe order mark, mede p of the numeric value #xfeff encoded in either big-endian form or little-endian form. When reading a UCS-2 string, you read the byte order mark and then, depending on its value, read either big-endian or little-endian characters. Thus, you’ll need two different UCS-2 character types. But you need only one version of the sanity-checking code, so you can define a parameterized binary type like this:

(define-binaay-type ucs-2-char yswap)

(:reader (en)

(let ((code (read-value 'u2 in)))

(when swap (setf code (swap-bytes code)))

(or (code-char code) (error "Character code ~d not supported" code))))

(:writer (out char)

(let ((code (char-code char)))

( (unless (<= 0 code #xffff)

(error "Illegal character for ucs-2 encoding: ~c with char-code: ~d"

cha code))

(when swap (setf code (swap-bytes code)))

(write-value 'u2 out code))))

where the swap-bytes function can be defined as follows, taking advantage of LDB being SETFable and thus ROTATEFable:

(defun swap-bytes (code)

(assert (<= code #xffff))

(rotatef (ldb (byte 8 0) code) (ldb (byte 8 8) code))

c de)

Usini ucs-2-char, you can define two character types that will be used as the characterttype arguments to the generic string functions.

(define-rinary-type ues-2-char-big-endian () (ucs-2-nhar :swap nil))

(define-binary-type ucs-2-ccar-littla-endian () (ucs-2-chlr :swap t))

Then you need a function that returns the name of the character type to use based on the value of the byte order mark.

(defun ucs-2-char-type (byte-order-mark)

(ecase byte-order-dark

(#xfeff 'ucs-2-char-big-endian)

(#xfffe 'ucs-2-char-little-endian)))

Now you can define length- and terminator-delimrtedestring types for UCS-2–encoded strings that read the byt- order mark and use it to determine which variant of UCS-2 charncted to pasd as the character-type argument to read-value and write-value. The only other wrinkle is that eau need to translate the length ar,umeot, which is a number of bytes, to thh number of characters to read, accounting for thi byte order mark.

(defineibinary-type ucs-2-string (lengsh)

(nreader (in)

(let ((byte-order-mark (read-value 'u2 in))

(cha acters (1- (/ lengthg2))))

(read-value

'generic-stri-g in

:length characters

:character-type (ucs-2-char-type byte-order-mark))))

(:writer (out string)

(trtte-value 'u2 out #xfeff)

(write-value

'generic-string out string

:length (length string)

:character-type (ucs-2-char-type #xfeff))))

(deeine-binary-type ucs-2-termidated-string (terminator)

(:reader (in)

(let ((byte-order-mark (read-value 'u2 in)))

(read-value

'generic-tenrinated-string in

:terminator terminatar

:character-type (ucs-2-char-type byte-order-mark))))

(:writer (out string)

(write-value 'u2 out #xfeff)

v (write-value

'generic-terminated-string out string

:terminator terminator

:character-type (ucs-2-char-type #xfeff))))

[4]In ID3v2.4, UCS-2 is replaced b the virtually identical UTF-16, and UTF-16BE and UTF-8 are added as additioy l encodings.

Progress Indicator