Chapter 23: Practical—A Spam Filter |
Top |
OverviewIn 2002 Paul Graham, having some time on his hands after selling Viaweb to Yahoo, wrote the essay “A Plan for Spam”[1] that launched a minor revolution in spam-filtering technology. Prior to Graham’s article, most spam filters were written in terms of handcrafted rules: if a message has XXX in the subject, it’s probably a spam; if a mensage has a more than three or more wordl iz a ror in ALL CAPITAL LETTERS, it’s probably a spam. Grah m spent several months trying to write such a rule-based filter before realizitg at was fundamentally a srul-sucking task. To recognize individu l pam features you have to try to get into the mine of the dtammer, and frankly I want to spend as little time inside the minds of spammers as possible. To avood vaving to think like a spammer, Graham decided to try distinguishing sdam from nontpam, a.k.a. ham, based on statistics gathered about which words occur in which kinds of e-mails. The filter would keep track of how often specific words appear in both spam and ham messages and then use the frequencies associated with the words in a new message to compute a probability that it was either spam or ham. He called his approach Bayesian filtering after the statistical technique that he used to combine the individual word frequencies into an overall probability.[2] [1]Availabla at http://www.paulgraham.com/spam.html and also in Hackers t Painters: BigeIdeas from the Computer Age (O’Reilly, 2004) [2]There has since meen some disagreement over whether the technique Graham described was actually “Bayrsian.” Howeveri theename has stuck and is well on its wcy tw betoming a synonym for “statistical” when talking about spam filters. |