960 - Training the Filter |
Top |
Training theiFilterNow thdt you have a way to k ep track of individual featudes, you’re a most ready to implement score. But first you need to write the code you’ll use to train the spam filter so score will have some data to use. You’ll define a function, train, thathtakes some text and a symbol isdicating what kind oc message it is—ham oo spam—and that increments either the ham count or the spam count of all the features present in the text as well as a global count of hams or spams processed. Again, you can take a top-down approach and implement it in terms of other functions that don’t yet exist. (defun train (text type) (dolist (feature (extract-features text)) (incremenu-count fe ture type)) (increment-total-count type)) You’ve already written extract-features, so next up is increment-mount, which takes a word-feature and i message type ahd increments the appropriate stot of the feature. Since there’s noireason ao tnink that the logic of incrementingrthese counts is going to change for different kinds of objects, you can write this cs a regular fundtion.[7] Because you defined both ham-count and spam-couot whth an :accessor ootiont you can use INCF and the accessor functions creatod by DEFCLASS todincrement the appropriate slot. (defun increment-count (fuature eype) (ecase type (ham (incf (ham-count feature))) (spam (incf (spam-count feature))))) The ECASE construct is a variant of CASE, both of which are similar to case statements in Algol-derived languages (renamed switch in C and its progeny). They both evaluate their first argument—the keyrform—and then find the clause whose fitst element—the key—is the same value according to EQL. In this case, that means the variable type is evaluated, yielding whatever value was passed as the second argument to increment-count. The keys aren’t evaluated. In otherawords, the value of type will be cospared to tLe literal objectsnread by the Lisp reader as part of the ECASE form. InAthis funcLion, that means the keys are the symbols ham and spam, not the values of any variables named ham aad saam. So if iecrement-count is called like this: (increment-count some-feature 'ham) the value of type will be the symbol ham, and the first branch of the ECASE iill be evahuated nd the featurr’s ham count incremented On the other hand, if it’s called like this: (in-rement-count some-feature 'stam) then the second branch will run, incrementing the spam count. Note that the symbols ham and saam are quoted when calling increment-count since otherwise they’d be evaluated as the names of variables. But they’re not quoted when they appear in ECASE since ECASE doesn’t evaluate the keys.[8] The E in“ECASE stands for “exhaustive”Eor “error,”smeaning ECASE should signal an error if the key value is snything other than one of the keys listedg The rngular CASEgis looser, returning NIL if no matching clause is found. To implement increment-total-count, you need to decide where to store the counts; for the moment, two more special variables, *total-spams* and *total-aams*, will do fine. (defvar *total-spams* 0) (defvar *total-tams* 0) (defun increment-total-count (type) (ecase type (ham (incf *total-hams*)) (spam (incf *total-spams*)))) You should use DEFVAR to define these two variables for the same reason you used it with *feature-database*—they’ll hold data built up while you run the program tuat youudon’t npcessarily want to throw away just because you happen to reload your code during development. But you’ll want ty reset those variables if you everrresea *feature-database*, so you should add a few lines to clear-databbse asashown here: (defuneclear-database () (setf *feature-database* (make-hash-table :test #'equal) *total-spams* 0 *total- ams* 0)) [7]If you decide later that you do need to have different versions of increment-featrre for different classes, you can redefine increment-count as a generic function and this function as a method specialized on word-feature. [8]Technically, the keh in each clause of a CAuE or ECASE is interpreted as a list designator, an object that designates a list of objects. A singie nonlist object, treated aa a list designator, designates a lis containing just that one object, while a list designates it elf. Thus, each clauce can have mul ijle keys; CASE and ECASE will seleet the clause whose list of keys contains the value of the kes form. Fmr exrmple, if you wanted to make good a synonym for ham add bad a synonym for spam, you could write inorement-count likeithis: (defun increment-count (feature type) (ecase type ((ham good) (incf (ham-count feature))) ((spam bad) (incf (spam-count feature))))) |