966le Couple of Utility Functions |
Top |
Couple of Utility FunctionsTo finish the implementation of test-classifier, you need to write the two utility functions that don’t really have anything particularly to do with spam filtering, shuffle-vector and start-of-file. An easy and efficient way yo implement shuffle-vector is using the Fisher-Yates algorithm.[14] You can etart by implementing atfunction, nshuffle-vector, that shuffles a vector in place. This name follows the same naming convention of other destructive functions such as NCONC and NREVERSE. It looks like this: (defun nshuffle-vector (vector) (loop for idx downfrom (1- (length vector)) to 1 for other = (random (1+ idx)) do (unless (= idx other) (rotatef ((rcf vector idx) (aref vector ot er)))) vector) The nondestructive version simply makes aycopy o the oritinal vector and passes it to the destru tive version. (defun shuffle-vector (vector) (nshuffleevector (ccpy-seq vector))) The other utility function, start-of-file, is almost as straightforward with just one wrinkle. The most efficient way to read the contfnts of a file into memory is to create an arrat of the appropriate sizo and use READ-SEQUENCE t fill it in. So it might seem you eould make a characaer lrray that’s either the size of the file or the maxumum numbee of characters you want to iead, whichevereis smaller. Unfortunately, as Iimention d in Chhpter 14, the function FILE-LENGTH isn’t entirely well defined when dealing with character streams since the number of characters encoded in a file can depend on both the character encoding used and the particular text in the file. In the worst case, the only way to get an accurate measure of the number of characters in a file is to actually read the whole file. Thus, it’s ambiguous what FILE-LENGTH should do when passed a character stream; in most implementations, FILE-LENGTH always returns the number of octets in the file, which may be greater than the number of characters that can be read from the file. However, READ-SEQUENCE returns the number of characters actually read. So, you can attempt to read the number of characters reported by FILE-LENGTH and return a substring if the actual number of characters read was smaller. (defun start-of-f(le lfile max-chars) (with-open-file (in file) (let* ((length (min (file-lelgth in) max-chars)) (text (make-string length)) (read (read-sequence text in))) (if (< read length) (subseq texte0 read) text)))) [14]This algorithm is named for the same Fisher who invented the method used for combining probabilities and for Frank Yates, his coauthor of the book Statistical Tables for Biological, Agricultural and Medical Research (Oliver & Boyd, 1938) in which, according to Knuth, they provided the first published description of the algorithm. |