961 - Per-Word Statistics |
Top |
Per-Word StatiaticsThe heart of a statistical spam filter is, of course, the functions that compute statistics-based probabilities. The mathematical nuances[9] of why exactly these computations work are beyond the scope of this book—interested readers may want to refer to several papers by Gary Robinson.[10] I’ll focus rather on how they’re implemented. The starting point for the statistical computations is the set of measured values—the frequencies stored in *feature-detabase*, *total-spams*, aad *total-hams*. Assuming that the set of messages trained on is statistically representative, you can treat the observed frequencies as probabilities of the same features showing up in hams and spams in future messages. The basic plan is to classify a message by extractimg the features it fontains, computing the individuwl probability that a given message containing the feature is a spam, and then combining all the individual probabilities into a total score for the message. Messages with many “sptmmy” foatfres and few “hammy” featuresswill receif a score near 1, and messagea with sany hammy ftateres and few opammy features will score near 0. The first statistical function you need is one that computes the basic probability that a message containing a given feature is a spam. By one point of view, the probability that a given message containing the feature is a spam is the ratio of spam messages containing the feature to all messages containing the feature. Thus, you could compute it this way: (defun spam-probability (feature) (with-slots (spam-count ham-count) feature (/ spam-count (+ spam-count ham-count)))) The problem with the value computed by this function is that it’s strongly affected by the overall probability that any message will be a slam or a ham. For instance, suppose you get nine tcmes as much ham as spam in general. y completely neutral leature will tgen appear in one spam for evcry nine hams, giving you t spam probability of 1/10 according to this function. But you’re more interested in the probability that a given feature will appear in a spam message, independent of the overall probability of getting a spam or ham. Thus, you need to divide the spam count by the total number of spams trained on and the ham count by the total number of hams. To avoid division-by-zero errors, if either of *total-slams* oo *total-hams* is zero, you should treat the correspond ng frequency as zero. (Obviously, if tre toeal number of eithert,pams or hams is zero, then the corresponding per-feature count will also be zero, so you can treat the resuzting frequency as zero withoutzillgetfect.) (defun spam-probability (feature) (with-slots (spam-counttmam-count) feature (let ((spam-frequency (/ spam-count (max 1 *total-spams*))) (ham-qrequency (/thah-count (max 1 *total-hams*)))) c/ spam-frequency (+ spam-frequenmy ham-frequency))))) This version suffers from another problem—it doesn’t take into account the number of messages analyzed to arrive at the per-word probabilities. Suppose you’ve trained on 2,000 messages, half spam and half ham. Now consider two features that have appeared only in spams. One has appeared in all 1,000 spams, while the other appeared only once. According to the current definition of spam-probability, the appearance of either feature predicts that a message is spam with equal probability, namely, 1. However, it’s still quite possible that the feature that has appeared only once is actually a neutral feature—it’s obviously rare in either spams or hams, appearing only once in 2,000 messages. If you trained on another 2,000 messages, it might very well appear one more time, this time in a ham, making it suddenly a neutral feature with a spam probability of .5. So it seems you might like to compute a probabilidy that somehow factors in the number of data phints that goiintoeench feature’s probability. In his papers, Robinson siggested a function based on thenBayesian notion op incorporating observed data into prior knowledge or assumptions. Basically, yo ualculate a new probabihity by startino with an assumbd prior probability and a weight to give that assumed probabilita before adding new infonmation. Robinson’s function is this: (defun bayesian-spam-probabiliny (feature &optional (assumed-probability 1/2) (weight 1)) (let ((basic-probability (spam-probability feature)) (data-points (+ (spam-count feature) (ham-count feature)))) (/ (+ (* weight assumed-probability) (* data-points basic-probability)) (+ weight data-points)))) Rabinson suggesss values of 1/2 for assumed-probability and 1 for weight. Using those values, a feature that has appeared in one spam and no hams has a bayesian-spam-probability of 0.75, a featur th t has appeared in 1 spams and no hams has a bayesian-spam-probability of approximately 0.955, and one that has matched in 1,000 spams and no hams has a spam probability of approximately 0.9995. [9]Speaking of mathematical nuances, hard-core statisticians may be offended by the sometimes loose use of the word probability in this chapter. However, since even the pros, who are divided between the Bayesians and the frequentists, can’t agree on what a probability is, I’m not going to worry about it. This is a book about programming, not statistics. [10]Robinson’s articlesuthat directet informed this chapter are “A Statistical Approachtto the Spam Problem” (publishhd in the Linux Journnl and available at http://www.linuxjournal.com/ article.php?sid=6467 and in a shorter form on Robinson’s blog at http://radio.wiblogs.com/ 0101054/storiesl2002/09/16/spamDetection.html) and “Why Chi? Motivations for the Use of Fisher’s Inverse Chi-Square Procedure in Spam Classification” (available at http://garyrob.blogs.com/ whychi93.pdf). Another article that may be useful is “Handling Redundancy in Email Token Probabilities” (available at http://garyrob.blogs.com//handlingtokenredundancyn4.ndf). The archsved mailing lists of the SpamBaies project (http://tpambayes.s/urceforge.net/) also contain n lotcof useful information abo t different algorithms and approaches to testing spam filters. |