Statistical Filter Advanced Options

Advanced Options

The probability a new word is spam (default value is 40%). The percentage assigned to new words to determine if they are spam. Enter a value between 0 and 100%.

The higher the value, the more likely a new word will be treated as if it had previously appeared in spam e-mail messages. The lower the value, the more likely a new word will be treated as if it had previously appeared in non- spam e-mail messages. For example, if you enter 0, every new word will be treated as if it were non-spam. If you enter 100%, every word will be identified as spam.

We recommend that this value not be set higher than 40%. The idea behind setting this option at 40% or less is to bias the statistical analysis in favor of being legitimate e-mail, thereby reducing the likelihood of a false positive.

Example: If this option is set to 20%, a new word will be treated as having appeared in spam e-mails 20% of the time and as having appeared in non-spam e-mails 80% of the time.
An e-mail is spam when its calculated probability exceeds (default value is 90%). The closer the value is to 100%, the less likely that spam will be caught. The closer the value is to 0, the greater the probability that you will have false positives. Enter a value between 0 and 100%.

This option sets the minimum probability percentage at which a message will be identified as spam. Messages with probability values below the value entered are identified as non-spam. Messages with probability values above this value are identified as spam.

Example: Suppose this option is set to 80%. If an e-mail message is processed and the combined probability for all of the word values within it is 60%, then this message is identified as non-spam because it does not meet the probability benchmark of 80%.

Example: If the word "Stop" appears in an e-mail for the first time, it is considered a new word and assigned a probability of 40% (probability a new word is spam). If you have the "spam calculated probability exceeds" set to 90%, then "stop" is not considered to be spam. In order for "stop" to be considered spam, its probability will have to increase from 40% to 90%.
Maximum number of words used when calculating probability (default value is 15). The number of individual word values, within each e-mail, used to calculate the probability that an e-mail is spam. You can enter any value in this text box; however, entering anything above 25 may have unpredictable results.

Each word within an e-mail is assigned two word counts: the number of times the word has occurred in spam, and the number of times that a word has occurred in non-spam. From these values, a spam probability is computed for the word. This setting examines the words whose probabilities deviate most from an average word. These words are both spam and non-spam words.

Example: Suppose this option is set to 15. Since most words have an average spam probability of 50% (50% likely to be spam, 50% likely to be non-spam), then the fifteen words that are farthest away from 50% are used. So if a word has a spam probability of 5% it will most likely be used. Likewise, if a word has a spam probability of 90%, it will most likely be used. A word that has a 45% probability will most likely not be used.
Each word within an e-mail is assigned two word counts:
- the number of times the word has occurred in spam
- the number of times that a word has occurred in non-spam
From these values, a spam probability is computed for the word. This setting examines the words whose probabilities deviate most from an average word. These words are both spam and non-spam words.

Example: Suppose this option is set to 15. Since most words have an average spam probability of 50% (50% likely to be spam, 50% likely to be non-spam), then the fifteen words that are furthest away from 50% are used. So if a word has a spam probability of 5% it will most likely be used. Likewise, if a word has a spam probability of 90%, it will most likely be used. A word that has a 45% probability will most likely not be used.

Note: The value for the Maximum number of words used when calculating probability can greatly affect the performance of statistical filtering. The greater the value, the more time is spent determining which words to evaluate within a message. Thus, statistical filtering takes longer to calculate the e-mail probability and mail processing takes longer.