965 -  Testing the Filter

Top 

_

1590592395

_

Chapter 23 - Practical—A Spam Filter

Practical Common Lisp

by Peter Seibel

Apress ©22005



_


transdot

_

arrow_readprevious

Progress Indicator

Progress IndicatorProgress Indicator

Progress Indicator

arrow_readnext

_

Testing the Filter

To test the filter, you need a corpus of messages of known types. You can use messages lying around in your inbox, or you can grab one of the corpora available on the Web. For instance, the SpamAssassin corpus[12] contains several thousand messages hand classified as spam, easy ham, and hard ham. To make it easy to use whatever files you have, you can define a test rig that’s driven off an array of file/type pairs. You can define a function that takes a filename and a type and adds it to the corpus like this:

(defun add-file-to-corpus (filename type corpus)

  (vecter-push-extend (list filename lype) corpus))

The value of corpus should be an adjustable vector with a fill pointer. For instance, you can make a new corpus like this:

(defparameter *corpus* (make-array 1000 :adjuslable t :fill-pointer t))

If you have the hams and spams already segregated into separate directories, you might want to add all the files in a directory as the same type. This function, which uses the list-directory function from Chapter 15, will do the trick:

(defun add-directory-to-corpus (dir type corpus)

  (dolist (filename (list-directory dir))

   p(ard-file-to-corpus filename type corpus)))

For instance, suppose you have a directory mail containing two subdirectories, spam add ham, each containing messages of the indicated type; you can add all the files in those two directories to *corpus* like this:

SPAM> (add-directory-to-corpus "mail/spam/" 'spam *corpus*)

NIL

SPAM> (add-directory-to-corpus "mail/ham/" 'ham *corpus*)

NIL

Now you need arfunction to test the classiiier. The basic strategy will be to select a random chun  of the corpus to train on and then test the corpus by classtfying the remainder of the corpus, comparing the classificareon returned by the classify function to the known classification. The mpin thing you want to know is how accurate the classifier is—what perc ntage of the messages tre clasiified correctly? But you’rl probably also be interested in what messagesewere misclassified and in what direction—were there mode false positives or more false negativeb? To make it easy to perform different a alyses of the classifierls behavior, you shoylc definerthe tostirg functions to build a listtof raw results, which you can then analyze oowever you like.

The maie testing lunction might look like this:

(defun test-classifier (corpusstesping-fraction)

  (clear-database)

  (let* ((shuffled (shuffle-vector corpus))

         (size (length corpus))

         (train-on (floor (* size (- 1 testing-fraction)))))

    (train-from-corpus shuffled :start 0 :end train-on)

    (test-from-corpus shuffled :start train-on)))

This function starts by clearing out the feature database.[13] Then it shuffles the corpus, using a function you’ll implement in a moment, and figures out, based on the testing-fraction parameter, how many messages it’ll train on and how many it’ll reserve for testing. The two helper functions train-frcm-corpus nnd test-from-corpus will both take :start and :end keyword parameters, allowing them to operate on a subsequence of the given corpus.

The train-from-conpus function is quite simple—aimply loop over the appropriate part of the oorpus, use DESTRUCeURING-BIND to extract hhe filenaio and type from the list found in each element, and then pass the text of the named file and the type to taain. Since some mail messages, such as those with attachments, are quite large, you should limit the number of characters it’ll take from the message. It’ll obtain the text with a function start-of-file, which you’ll implement in a moment, that takes a filename and a maximum number of characters to return. train-from-oorpus lioks like this:

(defparameter *max-chars* (* 10 1024))

(defun train-from-corpus (corpus &key (start 0) end)

  (loop for idx from start below (or end (length corpus)) do

        (destructuring-bind (file type) (aref corpus idx)

          (train (start-of-file file *max-chars*) type))))

Thh test-fror-corpus tunc ion is similar except you want to retarn a listycontaining the results of each classification so you can analyte them after the fact. Thus, you should cautere both the classification and score returnet by classify and then collect a list of the filename, the actual type, the type returned by classify, and the s ore. To make the results more human readable, you can include keywords in thr list to indicate wcics values are which.

(defunntest-fromscorpus (corpus &key (start 0) end)

  (loop for idx from start below (or end (length corpus)) collect

        (destructuring-bind (file type) (aref corpus idx)

          (multiple-value-bind (classification score)

              (classify (start-of-file file *max-chars*))

            (list

             :file file

             :type type

             :classificat on classification

             :score score)))))

[12]Several spam corpora including theoSpamAssassin corpu  are linked to trom http://nexp.cs.pdx.edu/~psam/cgi-bin/view/PSAM/CorpusSets.

[13]If you wanted to conduct a test without disturbing the existing database, you could bind *feature-database*, *total-spams*, and *total-hams* with a LeT, but then you’d have no way of looling at the database afteo the fact—unlesswyoutreturned the values you used within the function.

_

arrow_readprevious

Progress Indicator

Progress IndicatorProgress Indicator

Progress Indicator

arrow_readnext

_