Spam Filter Primer

Spam filtering is the concept of detecting and intercepting unwanted bulk mail — or "spam" — before it reaches a recipient's mailbox. Generally, spam filters detect bulk mail through the occurrence of certain phrases and known spammer IP addresses in incoming mail. However, because distributors of spam are increasingly innovative in their efforts to circumvent the spam filters that protect email users' mailboxes, spam is a moving target, and developers of spam-filtering technology are constantly being challenged in their quest to keep the bulk-mail onslaught at bay. Thus, in order to effectively shield email users from spam, a spam filter must be flexible. The Spam Xploder spam filter, therefore, enables users to personalize the filter by training it to detect and intercept mail that fit each user's particular preference and definition of spam.

The server-side Spam Xploder spam filter works in conjunction with the client-side (end user) Workspace Webmail interface. In essence, the end user utilizes the client-side interface to submit selected email messages for spam analysis. By analyzing the messages the spam filter compiles information that enables it to detect and intercept spam. As an increasing number of email messages are analyzed, the filter becomes increasingly adept at intercepting electronic mail that this particular user considers spam.

On the server end, the Spam Xploder spam filter works as follows:

When the mail program (i.e., Workspace Webmail) receives an email message, a connection to the Spam Xploder server is established. The Mail Transfer Agent accepts these incoming connections and receives the incoming email message. Messages enter the system and are handed to the spam filter.
The spam filter first strips out the "From:" and "Reply-to:" addresses from the message. These addresses are then compared to the user-defined allowed list of known good senders. If the address is allowed, filtering is complete and the message is delivered to the user's inbox without further ado.
If the sender is not allowed, the addresses are compared to the user-defined blocked list. If an address is blocked, that message is treated as spam and delivered to the "Bulk Mail" (or equivalent) folder of the user's email program. That completes the filtering process for that particular message.
If the filter concludes that a message is neither on the allowed or blocked lists, the message is subjected to the spam filter's statistical filtering analysis. The statistical filter uses a rigorous Bayesian analysis to determine if a message is spam.
The statistical filter starts the dissection by breaking the message content into a list of unique tokens A token is a word or any string of identifiable characters, such as dollar amounts and HTML tags. Once a complete list of tokens has been generated for a message, the list will be analyzed.
The analysis relies on two datasets. The first is the user dataset, which evolves as a user trains the filter. The user dataset is a personalized list of tokens compiled from the actual email that a user has received. The second dataset is the general dataset, which is a list of tokens intended to represent the average user. Each list entry in the user and general datasets consists of a token and an indication of the probability that an email message containing the token is bulk mail.
The analysis consists of comparing tokens found in the message to the user and general datasets. The user data is searched first, and the general data is searched only when a token cannot be found in the user data. If a token from the message is found in either dataset, the probability score for that token is noted. The probability score for a token indicates the probability that a message containing that token is spam. The overall probability is a statistical score that is calculated over the entire email message. The result of the calculation is a number between 0 and 99. — The higher the number, the higher the probability that the message is spam.
Once a complete list of probabilities has been compiled, the probabilities are used to calculate the message spam score. This score reflects the overall probability that the message is spam.
Messages determined to be spam by the statistical filter are dropped into the "Bulk Mail" (or equivalent) folder of the user's mailbox. Messages that are not considered spam are sent to the user's "Inbox." At this point, message filtering is complete.

Training the spam filter is the process of submitting email messages for spam analysis, thus gradually increasing the "intelligence" of the spam filter. That way, as the spam filter, compiles data it will become increasingly adept at detecting incoming spam.

In training the spam filter, the user can mark a message as either "spam" or "not spam." The process then proceeds thus:

Messages selected for training are passed to the server, along with a "flag" indicating whether the message should be considered spam or "good" mail. These messages undergo a content analysis similar to the statistical filter. The message is broken into tokens. Tokens are added to a list and counted. The list of tokens and counts is then analyzed.
The analysis consists of comparing tokens found in the message against the user dataset. Each token is searched for in the user data. If a token from the message is found in the user dataset, the previous spam and good mail counts for that token are retrieved. The counts are updated based on the "spam"/"not spam" flag, and the new spam probability is calculated for the token.
If the token is not found, a new record is added to the user dataset for the token, and the spam probability is calculated.

The spam filter's user data evolves and grows as more messages are analyzed. More tokens are added, and the probability scores are refined until the user has a well-defined set of personalized tokens commonly found in his/her incoming bulk and "good" mail. This adaptive scoring ensures that each user has a different definition of spam and good mail, thus making it very difficult to distribute mass mailings that evade the recipients' individually configured spam filters.

This personalized, adaptive approach guarantees fewer misclassifications of mail, as each user teaches the system his/her personal definition of what constitutes spam and good mail.

Bayes, Thomas
(b. 1702, London - d. 1761, Tunbridge Wells, Kent) Nonconformist theologian mathematician who first used probability inductively and established a mathematical basis for probability inference (a means of calculating, from the number of times an event has not occurred, the probability that it will occur in future trials). He set down his findings on probability in "Essay Towards Solving a Problem in the Doctrine of Chances" (1763), published posthumously in the Philosophical Transactions of the Royal Society of London. The only works he is known to have published in his lifetime are Divine Benevolence, or an Attempt to Prove That the Principal End of the Divine Providence and Government is the Happiness of His Creatures (1731) and An Introduction to the Doctrine of Fluxions, and a Defence of the Mathematicians Against the Objections of the Author of the Analyst (1736) which countered attacks by Bishop Berkeley on the logical foundations of Newton's calculus.