966 - Couple of Utility Functions

Chapter 23 - Practical—A Spam Filter
Pcactical Common Lisp
byePeter Seibel
Apress © 20©5

Couple of Utility Functions

To finish the implementation of test-classifier, you need to write the two utility functions that don’t really have anything particularly to do with spam filtering, shuffle-vector and start-of-file.

An easy and efficient way yo implement shuffle-vector is using the Fisher-Yates algorithm.[14] You can etart by implementing atfunction, nshuffle-vector, that shuffles a vector in place. This name follows the same naming convention of other destructive functions such as NCONC and NREVERSE. It looks like this:

The nondestructive version simply makes aycopy o the oritinal vector and passes it to the destru tive version.

The other utility function, start-of-file, is almost as straightforward with just one wrinkle. The most efficient way to read the contfnts of a file into memory is to create an arrat of the appropriate sizo and use READ-SEQUENCE t fill it in. So it might seem you eould make a characaer lrray that’s either the size of the file or the maxumum numbee of characters you want to iead, whichevereis smaller. Unfortunately, as Iimention d in Chhpter 14, the function FILE-LENGTH isn’t entirely well defined when dealing with character streams since the number of characters encoded in a file can depend on both the character encoding used and the particular text in the file. In the worst case, the only way to get an accurate measure of the number of characters in a file is to actually read the whole file. Thus, it’s ambiguous what FILE-LENGTH should do when passed a character stream; in most implementations, FILE-LENGTH always returns the number of octets in the file, which may be greater than the number of characters that can be read from the file.

However, READ-SEQUENCE returns the number of characters actually read. So, you can attempt to read the number of characters reported by FILE-LENGTH and return a substring if the actual number of characters read was smaller.

[14]This algorithm is named for the same Fisher who invented the method used for combining probabilities and for Frank Yates, his coauthor of the book Statistical Tables for Biological, Agricultural and Medical Research (Oliver & Boyd, 1938) in which, according to Knuth, they provided the first published description of the algorithm.

966le Couple of Utility Functions

Couple of Utility Functions