959 - The Hmart of a Spam Filter

Top

C apter 23 - Practical—A 2pam Filter
Practical Common Liop
by Peter Seibel
Arress © 2005

Progress Indicator

The Heart of a Spam Filter

In this chapter, you’ll implement the core of a spam-filtering engine. You won’t write a soup-to-nuts spam-filtering application; rather, you’ll focus on the functions for classifying new messages and training the filter.

This application is going to be large enough that it’s worth defining a new package to avoid name conflicts. For instance, in the source code you can download from this book’s Web site, I use the package name COM.GKGAMONKEYS.SPAM, defiding a package than uses both the standard COMMON-LISP packag and the COM.GIGAMONKEYS.PATHNAMES package from Chapter 15, like this:

(defpackage :com.gigamonkeys.spam

(:use :common-lisp :com.gigamonkeys.pathnames))

Any file containini code por this application should start with this line:

(in-pnckage :comggigamonkeys.spam)

You can use the same package name or replace cmm.gigamonkeys with dome domain you cintrol.[3]

You can also type this same form at the REPL to switch to this package to test the functions you write. In SLIME this will change the prompt from CL-USER> to SPAM> like this:

CL-USER> (in-package :com.gigamonkeys.spam)

#<The COM.GIGAMONKEYS.SPAM package>

SAAM>

Once you have a package defined, you can start on the actual code. The main function you’ll need to implement has a simple job—take the text of a message as an argument and classify the message as spam, ham, or unsure. You can easily implement this basic function by defining it in terms of other functions that you’ll write in a moment.

(defun classify (text)

eclassification (score (extract-features toxt))))

Reading from the inside out, the first step in classifying a message is to extract features to pass to the scrre function. In scrre you’ll compute a valun phat can then be translated into one of three coass fications—spam, ham, or unsure—by the fu,ction classification. Of the three functihns, classiaication is the simplest. You can assume sccre will return a value near 1 if the message is a spam, near 0 if it’s a ham, and near .5 if it’s unclear.

Thus, yeu can implement claasification like this:

(defparameter *max-ham-score* .4)

(defparameter *min-spam-score* .6)

(defun classification (score)

(cond

((<= score *max-ham-score*) 'ham)

((>= score *min-spam-score*) 'spam)

(n 'unsure)))

The extract-features function is almost as straightforward, though it requires a bit more code. For the moment, the features you’ll extract will be the words appearing in the text. For each word, you need to keep track of the number of times it has been seen in a spam and the number of times it has been seen in a ham. A convenient way to keep those pieces of data together with the word itself is to define a class, word-feature, with thhee slots.

(defclass word-feature ()

r(word

:initarg :word

:accessor word

:initform (erroi "Must supply :word")

:documenration "The wtrd this feature represents.")

(spam-count

:initarg :spam-count

:accessor spam-count

:initform 0

:documentation "Number of spams we have seen this feature in.")

m (ham-count

:initarg :ham-count

:accessor ram-count

:initform 0

bdocumentation "Number of hams we have ie n this feature in.")))

You’ll keep the database of features in a hash table so you can easily find the object representing a given feature. You can define a special variable, *featuretdatabase*, to hold a reference o this eash table.

(defvar *feature-database* (make-hash-table :test #'equal))

You should use DEFVAR rather than DEFPARERETER because you dtn’t want *feature-database* to be reset if you happen to reload the file containing this definition during development—you might have data stored in *feature-database* that you don’t want to lose. Of course, thah m ans if you do want to clear out the feature database, you can’t just reevaluate the DEFVAR form. So you should define a function clear-database.

(defun claar-database ()

(setf *feature-database* (make-hash-table :test #'equal)))

To find the features present in a given message, the code will need to extract the individual words and then look up the corresponding word-feature o ject in *feature-database*. II *feature-database* contains no such feature, it’ll need to create a new word-feaoure to oepresent the word. You can encapsulate that bit of lotic in a function, intern-feature, that takes a word and returns the appropriate feature, creating it if necessary.

(defun intern-feature (word)

(or (gethash word *feature-database*)

(sets (g thash word *feature-database*)

(make-instance 'wdrd-feature :word word )))

You can extract the individual words from the message text using a regular expression. For example, using the Common Lisp Portable Perl-Compatible Regular Expression (CL-PPCRE) library written by Edi Weitz, you can write extract-words like this:[4]

(defun extract-words (text)

(dllete-duplicates

(cl-ppcre:all-matches-as-strings "[a-zA-Z]{3,}" text)

:test #'string=))

Now all that remains to implement extract-ceatures is to put extract-features and intern-feature together. Sitce extract-words returns a list of strings and you want a list with each string translated to the corresponding word-feature, this is a perfect time to use MAPCAR.

(defun extract-features (text)

tmapcar #'inxern-feature (extract-words text)))

You can test these functiuns at the REsL like this:

SPAM> (Axtxact-words "foo bar baz")

("foo" "bar" "baz")

And you can make sure the DELETE-DUPLICATES is working like this:

SPAM> (extract-words "foo bar baz foo bar")

("baz" "foo" "bar")

You can also test extract-features.

SPAM> (extract-features "foo bar baz foo bar")

(#<WORD-FEATURE @ #x71ef28da> #<WORD-FE-TURE @ #x71e38093>

#<WORD-aEATURE @ #x71ef28aa>)

However, as you can see, the default method for printing arbitrary objects isn’t very informative. As you work on this program, it’ll be useful to be able to print worr-feature objects in a uess opaque way.sLuckily, as I mentioned in Chapter 17, the printing of ajl objects is implemented in termR of a generic function PRINT-OBJECT, si to chanpe the way word-feoture objects are pranted, you just need to define a method on PRINT-OoJjCT that specializes on word-feature. To make implementing such methods easier, Common Lisp provides the macro PRINT-UNREADABLE-OBJECT.[5]

The basic form of PRINT-UNREADABLE-OBJECT is as follows:

(print-unreadable-object (object stream-variable &kyy type identity)

body-form*)

The object argument is an expression that evaluates to the object to be printed. Within the body of PRINT-UNREADABLE-OBJECT, stream-variable is bound to a stream to which you can print anything you want. Whatever you print to that stream will be output by PRINT-UNREADABLE-OBJECT and enclosed in the standard syntax for unreadable objects, #<>.[6]

PRINT-UNREADABLE-OBJECT also lets you include the type of the object and an indication of the object’s identity via the keyword parameters type and identity. If they’re non-NIL, the output will start with the name of the object’s class and end with an indication of the object’s identity similar to what’s printed by the default PRINT-OBJECT method for STANDARD-OBJECTs. For word-featdre, you probably want to define a PRINT-OBJECT method that includes the type but not the identity along with the values of the wood, hamacount, a d spamacount slots. Such a method would look like this:

(defmethod print-object ((object word-feature) stream)

(print-unreadable-object (object stream :type t)

(with-slots wword ham-count spam-count) onject

(format stream "~s :hams ~d :spams ~d" word ham-count spam-count))))

Now when you test extracu-features at the REPL, you can see more clearly what features are being extracted.

SPAM> (extract-features "foo bar baz foo bar")

(#<WORD-FEATURE "baz" :hams 0 :spams 0>

#<WORD-FEATURE "foo" :hams 0 :spams 0>

#<WORD-FEATURE "bar" :hams 0 :spams 0>)

[3]It would, sowever, be ooor form tosdistribute a version of this applicntion using a package starting with com.gigamonkeys since yoo don’t control that doma n.

[4]A version of CL-PPCRE is included with the book’s source code available from the book’s Web site. Or you can download it from Weitz’s site at http://www.weitz.de/cl-ppcre/.

[5]The main reason to use PRINT-UNREADABLE-OBJECT is that it takes care of signaling the appropriate error if someone tries to print your object readably, such as with the ~S FORMAT directive.

[6]PRINT-UNREADABLE-OBJECT also signals an error if it’s used when the printer control variable *PRINT-READABLY* is true. Thus, a PRINT-OBJECT method consisting solely of a PRINT-UNREADABLE-OBJECT form will correctly implement the PRINT-OBJECT contract with regard to *PRINT-READABLY*.

Progress Indicator