How the Workspace Webmail Spam
Filter Works — A Primer
Introduction
The Spam Xploder spam filter, which is an integrated feature of the
Workspace Webmail mail client, is a service that screens incoming mail at the
server level. Through the Workspace Webmail interface the user can
train the filter, thus gradually improving the filter's ability to detect
incoming bulk mail.
About Spam Filtering
Spam filtering is the concept of detecting and intercepting unwanted bulk
mail — or "spam" — before it reaches a recipient's mailbox. Generally, spam
filters detect bulk mail through the occurrence of certain phrases and known
spammer IP addresses in incoming mail. However, because distributors of spam
are increasingly innovative in their efforts to circumvent the spam filters
that protect email users' mailboxes, spam is a moving target, and developers of
spam-filtering technology are constantly being challenged in their quest to
keep the bulk-mail onslaught at bay. Thus, in order to effectively shield email
users from spam, a spam filter must be flexible. The Spam Xploder spam filter,
therefore, enables users to personalize the filter by training it to detect and
intercept mail that fit each user's particular preference and definition of
spam.
How the Filter Functions
The server-side Spam Xploder spam filter works in conjunction with the
client-side (end user) Workspace Webmail interface. In essence, the end user
utilizes the client-side interface to submit selected email messages for spam
analysis. By analyzing the messages the spam filter compiles information that
enables it to detect and intercept spam. As an increasing number of email
messages are analyzed, the filter becomes increasingly adept at intercepting
electronic mail that this particular user considers spam.
On the server end, the Spam Xploder spam filter works as follows:
-
When the mail program (i.e., Workspace Webmail) receives an email message, a
connection to the Spam Xploder server is established. The Mail Transfer Agent
accepts these incoming connections and receives the incoming email message.
Messages enter the system and are handed to the spam filter.
-
The spam filter first strips out the "From:" and "Reply-to:" addresses from the
message. These addresses are then compared to the user-defined allowed list of
known good senders. If the address is allowed, filtering is complete and
the message is delivered to the user's inbox without further ado.
-
If the sender is not allowed, the addresses are compared to the
user-defined blocked list. If an address is blocked, that message is treated
as spam and delivered to the "Bulk Mail" (or equivalent) folder of the user's
email program. That completes the filtering process for that particular
message.
-
If the filter concludes that a message is neither on the allowed or blocked lists, the
message is subjected to the spam filter's statistical filtering analysis. The
statistical filter uses a rigorous Bayesian
analysis to determine if a message is spam.
-
The statistical filter starts the dissection by breaking the message content
into a list of unique tokens
A token is a word or any string of identifiable characters, such as dollar
amounts and HTML tags. Once a complete list of tokens has been generated for a
message, the list will be analyzed.
-
The analysis relies on two datasets. The first is the user dataset, which
evolves as a user trains the filter. The user dataset is a personalized list of
tokens compiled from the actual email that a user has received. The second
dataset is the general dataset, which is a list of tokens intended to represent
the average user. Each list entry in the user and general datasets consists of
a token and an indication of the probability that an email message containing
the token is bulk mail.
-
The analysis consists of comparing tokens found in the message to the user and
general datasets. The user data is searched first, and the general data is
searched only when a token cannot be found in the user data. If a token from
the message is found in either dataset, the probability score for that token is
noted. The probability score for a token indicates the probability that a
message containing that token is spam. The overall probability is a statistical
score that is calculated over the entire email message. The result of the
calculation is a number between 0 and 99. — The higher the number, the higher
the probability that the message is spam.
-
Once a complete list of probabilities has been compiled, the probabilities are
used to calculate the message spam score. This score reflects the overall
probability that the message is spam.
-
Messages determined to be spam by the statistical filter are dropped into the
"Bulk Mail" (or equivalent) folder of the user's mailbox. Messages that are not
considered spam are sent to the user's "Inbox." At this point, message
filtering is complete.
Training the spam filter is the process of submitting email messages for
spam analysis, thus gradually increasing the "intelligence" of the spam filter.
That way, as the spam filter, compiles data it will become increasingly adept
at detecting incoming spam.
In training the spam filter, the user can mark a message as either "spam" or
"not spam." The process then proceeds thus:
-
Messages selected for training are passed to the server, along with a "flag"
indicating whether the message should be considered spam or "good" mail. These
messages undergo a content analysis similar to the statistical filter. The
message is broken into tokens. Tokens are added to a list and counted. The list
of tokens and counts is then analyzed.
-
The analysis consists of comparing tokens found in the message against the user
dataset. Each token is searched for in the user data. If a token from the
message is found in the user dataset, the previous spam and good mail counts
for that token are retrieved. The counts are updated based on the "spam"/"not
spam" flag, and the new spam probability is calculated for the token.
-
If the token is not found, a new record is added to the user dataset for the
token, and the spam probability is calculated.
The spam filter's user data evolves and grows as more messages are analyzed.
More tokens are added, and the probability scores are refined until the user
has a well-defined set of personalized tokens commonly found in his/her
incoming bulk and "good" mail. This adaptive scoring ensures that each user has
a different definition of spam and good mail, thus making it very difficult to
distribute mass mailings that evade the recipients' individually configured
spam filters.
This personalized, adaptive approach guarantees fewer misclassifications of
mail, as each user teaches the system his/her personal definition of what
constitutes spam and good mail.
Bayes, Thomas
(b. 1702, London - d. 1761, Tunbridge Wells, Kent) Nonconformist theologian
mathematician who first used probability inductively and established a
mathematical basis for probability inference (a means of calculating, from the
number of times an event has not occurred, the probability that it will occur
in future trials). He set down his findings on probability in "Essay Towards
Solving a Problem in the Doctrine of Chances" (1763), published posthumously in
the Philosophical Transactions of the Royal Society of London. The only works
he is known to have published in his lifetime are Divine Benevolence, or an
Attempt to Prove That the Principal End of the Divine Providence and Government
is the Happiness of His Creatures (1731) and An Introduction to the Doctrine of
Fluxions, and a Defence of the Mathematicians Against the Objections of the
Author of the Analyst (1736) which countered attacks by Bishop Berkeley on the
logical foundations of Newton's calculus.
Source: Encyclopædia Britannica
Bayesian:
being, relating to, or concerned with a theory (as of decision making or
statistical inference) involving the application of Bayes' theorem and the use
of probabilities based on prior knowledge and accumulated experience
<bayesian probability models>.
Source: Merriam-Webster
Back to Top