Statistical Filter Options (Content Filtering)

Use Statistical Filtering to create and maintain the mail domain specific exclude list, specify the action to take when spam is identified, and specify whether to use the primary mail domain's word counts or create new ones.

Statistical Filtering uses the Bayesian spam filtering technique to calculate the probability of a message being spam based by its contents. Each word in an e-mail message is examined and evaluated depending on how often the word appears in spam and non-spam e-mail. The entire message is then evaluated based on all of the word values to determine whether it is likely to be spam.

Domain: Shows the current selected domain. From the drop down you can pick any of the domains available to this administrative user account.

Antispam Table To Use: Set the following options to configure statistical filtering.

No Filtering. Disables statistical filtering for the domain.
Current Domain (selected by default). Select this option to define statistical filtering settings specific to the current mail domain.
Primary Domain (default for non-primary domains; not available for primary domains). Select this option to use the primary mail domain's statistical filtering settings instead of creating new settings for the current mail domain.
Note: The exclude table is not included as part of the use drop down.

Exclude the following words from Statistical Analysis:

Add. Click Add to create a new word to filter for the current domain.
Edit. Click a word or phrase, then click Edit to modify.
Delete. Select a phrase that you want to delete from the domain, then click Delete to delete the phrase.
Click or to sort the word list.

If the Word list has multiple pages, you can use the page navigation control which appears below the list.

Action taken on e-mail determined to be spam:

Action:
- Delete. Immediately deletes the message.
- Forward to Address. Forwards the message to an e-mail address entered in the text box to the right of this option. By default, messages are sent to the root address and stored in a mailbox called "bulk". Example
- Insert X- Header (default). Inserts an X- Header into the message indicating that the message was identified as spam by statistical filtering. For more information, see Spam X-Header Explanations.
- Move to Mailbox. Moves the message to the user's mailbox specified in the text box to the right of this option. If the mailbox does not exist, it is created.
- None. No action is performed on messages identified as spam by the statistical filter.

Tip: We recommended that you select the Insert X- Header option instead of Delete until you know that the antispam options are setup correctly.

Note: For more spam options see Using Delivery Rules to Filter Spam.

Prefix subject with. If selected, the subject of a message that is identified as spam by the statistical filter will be modified to begin with X-IMail-Spam- Statistical.

These options control the underlying functionality of the statistical filtering feature and are dependant upon each other to effectively identify spam. If you have a significant number of legitimate messages that are being identified as spam (false positives) or vice versa, you may need to adjust these options.

Note: The default settings are appropriate for most systems. We strongly advise that ONLY experienced administrators modify these settings. Setting these options too high or too low could hinder IMail Server's ability to identify spam.

Advanced Options

The probability a new word is spam (default value is 40%). The percentage assigned to new words to determine if they are spam. Enter a value between 0 and 100%.

The higher the value, the more likely a new word will be treated as if it had previously appeared in spam e-mail messages. The lower the value, the more likely a new word will be treated as if it had previously appeared in non- spam e-mail messages. For example, if you enter 0, every new word will be treated as if it were non-spam. If you enter 100%, every word will be identified as spam.

We recommend that this value not be set higher than 40%. The idea behind setting this option at 40% or less is to bias the statistical analysis in favor of being legitimate e-mail, thereby reducing the likelihood of a false positive.

Example: If this option is set to 20%, a new word will be treated as having appeared in spam emails 20% of the time and as having appeared in non-spam emails 80% of the time.
An e-mail is spam when its calculated probability exceeds (default value is 90%). The closer the value is to 100%, the less likely that spam will be caught. The closer the value is to 0, the greater the probability that you will have false positives. Enter a value between 0 and 100%.

This option sets the minimum probability percentage at which a message will be identified as spam. Messages with probability values below the value entered are identified as non-spam. Messages with probability values above this value are identified as spam.

Example: Suppose this option is set to 80%. If an e-mail message is processed and the combined probability for all of the word values within it is 60%, then this message is identified as non-spam because it does not meet the probability benchmark of 80%.

Example: If the word "Stop" appears in an e-mail for the first time, it is considered a new word and assigned a probability of 40% (probability a new word is spam). If you have the "spam calculated probability exceeds" set to 90%, then "stop" is not considered to be spam. In order for "stop" to be considered spam, its probability will have to increase from 40% to 90%.
Maximum number of words used when calculating probability (default value is 15). The number of individual word values, within each e-mail, used to calculate the probability that an e-mail is spam. You can enter any value in this text box; however, entering anything above 25 may have unpredictable results.

Each word within an e-mail is assigned two word counts: the number of times the word has occurred in spam, and the number of times that a word has occurred in non-spam. From these values, a spam probability is computed for the word. This setting examines the words whose probabilities deviate most from an average word. These words are both spam and non-spam words.

Example: Suppose this option is set to 15. Since most words have an average spam probability of 50% (50% likely to be spam, 50% likely to be non-spam), then the fifteen words that are farthest away from 50% are used. So if a word has a spam probability of 5% it will most likely be used. Likewise, if a word has a spam probability of 90%, it will most likely be used. A word that has a 45% probability will most likely not be used.
Each word within an e-mail is assigned two word counts:
- the number of times the word has occurred in spam
- the number of times that a word has occurred in non-spam
From these values, a spam probability is computed for the word. This setting examines the words whose probabilities deviate most from an average word. These words are both spam and non-spam words.

Example: Suppose this option is set to 15. Since most words have an average spam probability of 50% (50% likely to be spam, 50% likely to be non-spam), then the fifteen words that are furthest away from 50% are used. So if a word has a spam probability of 5% it will most likely be used. Likewise, if a word has a spam probability of 90%, it will most likely be used. A word that has a 45% probability will most likely not be used.

Note: The value for the Maximum number of words used when calculating probability can greatly affect the performance of statistical filtering. The greater the value, the more time is spent determining which words to evaluate within a message. Thus, statistical filtering takes longer to calculate the e-mail probability and mail processing takes longer.

Related Topics

About Statistical Filtering

Creating Separate antispam-table.txt Files for Multiple Email Domains

Installing Updated phrase.txt File

Setting Premium Filter Antispam Options