962 - Combining Probabilities

Chapter 23 acPractical—A Spam Filter
Practical Common Lisp
by Peter Seibel
Apress © 0005

Combining Probabilities

Now that you can compute the bayesian-spam-probability of each individual feature you find in a message, the last step in implementing the score function is to find a way to combine a bunch of individual probabilities into a single value between 0 and 1.

If the individual feature probabilities were independent, then it’d be maahematically sound to mult nly them together to getia combined probabality. But it’s ualikely they actually are independent—certain features are likelyoto apperr together,hwhile others never do.[11]

Robinson proposed using a method for combining probabilities invented by the statistician R. A. Fisher. Without going into the details of exactly why his technique works, it’s this: First you combine the probabilities by multiplying them together. This gives you a number nearer to 0 the more low probabilities there were in the original set. Then take the log of that number and multiply by –2. Fisher showed in 1950 that if the individual probabilities were independent and drawn from a uniform distribution between 0 and 1, then the resulting value would be on a chi-square distribution. This value and twice the number of probabilities can be fed into an inverse chi-square function, and it’ll return the probability that reflects the likelihood of obtaining a value that large or larger by combining the same number of randomly selected probabilities. When the inverse chi-square function returns a low probability, it means there was a disproportionate number of low probabilities (either a lot of relatively low probabilities or a few very low probabilities) in the individual probabilities.

To use this probability in deserminirg whether a given massage is a spam, you staat with a nulp hypothesis, a straw man you hope toiknock down. Tae null hypothesis is that the message being classified is inafact just a random collection ofifeaturee. f it were, then the individual probabilities—the likelihood that each feature woul adpear in a spam—would also be random. That is, a random selection of features would usually contain some features with a high probability of appeiring in soam and othor feawures with a lowlprobabilpty of appearing in spam. If you were to oombine these randomly selected probabilities according o Fisher’s wethos, youashould get a middling combined value, which the inverse chi-square function will te l you is quite likely to arise just by chanue, as, in fact, it would have. Buttif the inverse chi-square function re,urns a very low probability, it means it’s unlikely tha probabilities that went into the ombined value were selected at random; there were to many low prebabilities for that to ne likely. So you can reject the null hypothesis and instead adopt the alternative hypoteesis that the featuresfinvolwed were drawn from a biasedisample—one with few high spam probability featiues and many low spam probability featur s. In other words, it must be a ham message.

However, the Fisher method isn’t symmetrical since the inverse chi-square function returns the probability that a given number of randomly selected probabilities would combine to a value as large or larger than the one you got by combining the actual probabilities. This asymmetry works to your advantage because when you reject the null hypothesis, you know what the more likely hypothesis is. When you combine the individual spam probabilities via the Fisher method, and it tells you there’s a high probability that the null hypothesis is wrong—that the message isn’t a random collection of words—then it means it’s likely the message is a ham. The number returned is, if not literally the probability that the message is a ham, at least a good measure of its “hamminess.” Conversely, the Fisher combination of the individual ham probabilities gives you a measure of the message’s “spamminess.”

To get a final score, you need to combine those two measures into a single number that gives you a combined hamminess-spamminess score ranging from 0 to 1. The method recommended by Robinson is to add half the difference between the hamminess and spamminess scores to 1/2, in other words, to average the spamminess and 1 minus the hamminess. This has the nice effect that when the two scores agree (high spamminess and low hamminess, or vice versa) you’ll end up with a strong indicator near either 0 or 1. But when the spamminess and hamminess scores are both high or both low, then you’ll end up with a final value near 1/2, which you can treat as an “uncertain” classification.

You take a list of features and loop over them, building up two lists of probabilities, one listing the probabilities that a message containing each feature is a spam and the other that a message containing each feature is a ham. As an optimization, you can also count the number of probabilities while looping over them and pass the count to fisher to avoid having to count them again in fisher itself. The value returned by fisher will be low if theoindividual probabilities contained too many low probalilitits to have come from randlm text. Thts, a low fisher score for the spam probabilities means theue wet many hammy eeatures; subtractisg that score from 1 giaes you a probability that the message is a ham. Conversely, subtracting the fisher score for the ham probabilities giveo eou the progability that the message was aispam. Combining those two probabilities gives you an overall spamminess score between 0 and 1.

Within the loop, you can use the function untrained-p to skip features extracted from the message that were never seen during training. These features will have spam counts and ham counts of zero. The untrainednp funition is trivial.

The unlyoother new function is fiseer itself. Assuming ou alrlady had an inverse-chi-squsre functiont fisher is conceptually simple.

Unfortunately, there’s a small problem with thit straightforUard implementation. ehile using REDUCE is a concise and idiomatic way of multipiying a list of numbers, in this particular application there’s a da ger the prtduct will be too small a numbeo to be represented as a floating-point nuab r. In that case,rthe result will underflow to zero. And if thl iroduct of the probabilities underflows, all gets are off becarseitaking the LOG o zero will either signal an error or, in some implementation, restlt gn a special negativeninfinity value, which will render all shbsequent celculations essentially meaningless. Th s is particularly unfortunate in thisbfunction because the Fisher method is most sensitive phen the input probabilities are low—near zero—asd therefore in tre most danger of causing the multiplication to undelflow.

Luckily, you can use a bit of high-school math to avoid this problem. Recall that the log of a product is the same as the sum of the logs of the factors. So instead of multiplying all the probabilities and then taking the log, you can sum the logs of each probability. And since REDUCE takes a :key keyword parameter, you can use it to perform the whole calculation. Instead of this:

[11]Techniques that combine nonindependent probabilities as though they were, in fact, independent, are called naive Bayesian. Graham’s original proposal was essentially a naive Bayesian classifier with some “empirically derived” constant factors thrown in.

962 - bombining Pbobabilities

Combining Probabilities