Overview of HTML Filtering

For each HTML section of an e-mail, the HTML filter processes the text outside the angle brackets of an HTML tag as before. The HTML filter processes the text within the angle brackets of an HTML tag as follows. The HTML filter first checks to see if the tag is one of the features the filter has been configured to search for. If it is, the HTML Filter counter counts the number of features found. The e-mail is considered spam if the number of HTML features found equals the number configured for the feature's found count.

HTML filtering is part of content filtering, but is used only on HTML portions of a message. The individual components of HTML filtering are discussed below.

Types of HTML Filtering:

HTML Parser

The HTML parser is always used on HTML messages. The parser decodes the HTML code and tags until the text appears as it would when the message is opened. The parser then sends the text on to be processed by statistical and phrase filtering to determine if it is spam.

HTML Feature Filtering

HTML feature filtering lets you define certain HTML tags that will be spam indicators. The HTML features include Nested Table, Hyperlink, Script Tag, Invalid Tag, Image Tag, Mailto Hyperlink, Deceptive URL, Embedded Comment. If a message contains a configurable number of these HTML features, it is identified as spam.

URL Domain Black List

The URL Domain Black List is a configurable list of domain names that are known to send spam. IMail Server extracts the primary domain from an http link to determine if the domain name is in the URL Domain Black List. It does this by looking for domains that are used in HREF and IMG SRC tags in the HTML code. If the primary domain matches any of the domain names in the URL Domain Black List, the e-mail is considered spam and the appropriate spam action is taken.

Why do I need HTML Filtering? Why doesn't the Phrase and Statistical Filter Work?

Spammers use a variety of techniques to get around antispam programs that filter on words. The primary way they do this is by disguising the message text in HTML e-mail so that is does not look like text. Unfortunately, if a word does not look like a word, the phrase and statistical filter will not be able to determine if it is spam. The HTML filter component solves this problem by decoding the HTML code to reveal the text, which is then passed on to the statistical filter for word analysis.