"How to prevent spam" new FAQ entry

Alessandro Delgado Sun, 02 Dec 2007 22:06:07 -0800

I'd like to improve this text, but I'm not sure I fully understand the concept 
of "training" the mailet.


Any suggestions/explanations would be gladly received! ;D

Thanks, Alessandro

--
DELGADO, Alessandro

adelgado1313 [at] gmail [dot] com

Rio de Janeiro, RJ - BRASIL

_________________________________________________________________
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE

How to prevent Spam?

There are multiple ways to prevent Spam with James, the more popular being the 
BayesianAnalisys mailet. It scans the message, and determines the probability 
that it is spam, using [bayesian probability theory techniques|1].

The analysis capabilities are based on token frequencies (the corpus) learned 
through a training process using the BayesianAnalysisFeeder mailet and stored 
in a JDBC database. During mailet initialization the corpus is loaded from the 
database and kept in memory.

After a training session, the corpus must be rebuilt from the database in order 
to acquire the new frequencies. Every 10 minutes a special thread will check if 
any change was made to the database by the feeder, and rebuild the corpus for 
this mailet if necessary.

A org.apache.james.spam.probability mail attribute will be created containing 
the computed spam probability as a java.lang.Double. A message header string 
named as specified in the headerName init parameter will be created containing 
such probability in floating point representation.

The probability of being spam is pre-pended to the subject if it is > 0.1 (10%).

Here follows an example of config.xml definitions deploying the analysis mailet:

         <mailet match="All" class="BayesianAnalysis" 
onMailetException="ignore">
            <repositoryPath>db://maildb</repositoryPath>
            <maxSize>200000</maxSize>
            <headerName>X-MessageIsSpamProbability</headerName>
            <ignoreLocalSender>true</ignoreLocalSender>
         </mailet>
     
         <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 
0.90" class="AddHeader" onMatchException="noMatch">
            <name>X-MessageIsSpam</name>
            <value>true</value>
         </mailet>

         <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 
0.99" class="ToProcessor" onMatchException="noMatch">
            <processor> spam </processor>
            <notice>Spam not accepted</notice>
         </mailet>

The required tables are automatically created if not already there (see 
sqlResources.xml). The token field in both the ham and spam tables is case 
sensitive.


The parameters' syntax is as follows:
        <repositoryPath>: an URL pointing to the <data-source> containing the 
database tables used (typically db://maildb).
        <headerName>: the header name to add with the spam probability (default 
is X-MessageIsSpamProbability).
        <ignoreLocalSender>: true if you want to ignore messages coming from 
local senders (default is false). By local sender we mean a return-path with a 
local server part (server listed in <servernames> in config.xml)..
        <maxSize>: the maximum message size (in bytes) that a message may have 
to be considered spam (default is 100000).


The BayesianAnalysisFeeder mailet feeds ham OR spam messages to train the 
BayesianAnalysis mailet.

The new token frequencies are stored in a JDBC database.

The bayesian database tables are updated during the training reflecting the new 
data.

The correct approach is to send the original ham/spam message as an attachment 
to another message sent to the feeder; all the headers of the enveloping 
message will be removed and only the original message's tokens will be analyzed 
and used for feeding. This because all the tokens of a message are examined by 
the BayesianAnalysis mailet (including headers), and hence the feeding process 
must be consistent.

After a training session, the frequency corpus used by the BayesianAnalysis 
mailet must be rebuilt from the database, in order to take advantage of the new 
token frequencies. Every 10 minutes a special thread in the BayesianAnalysis 
mailet will check if any change was made to the database, and rebuild its 
corpus if necessary.

Only one message at a time is scanned (the database update activity is 
synchronized) in order to avoid too much database locking, as thousands of rows 
may be updated just for one message being fed.


Here follows an example of config.xml definitions deploying the feeder mailet:
         <!-- "not spam (ham)" bayesian analysis feeder. -->
         <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
            <repositoryPath> db://maildb </repositoryPath>
            <feedType>ham</feedType>
            <maxSize>200000</maxSize>
         </mailet>

         <!-- "spam" bayesian analysis feeder. -->
         <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
            <repositoryPath> db://maildb </repositoryPath>
            <feedType>spam</feedType>
            <maxSize>200000</maxSize>
         </mailet>

The parameter's syntax is as follows:
        <repositoryPath>: an URL pointing to the <data-source> containing the 
database tables used (typically db://maildb).
        <feedType>: the type of message being fed. The possible values are 
either ham (good messages) or spam.
        <maxSize>: the maximum message size (in bytes) that a message may have 
to be considered spam (default is 100000).


The previous example will allow the user to send messages to the server and use 
the recipient email address as the indicator for whether the message is ham or 
spam.

Using the example above, send good messages (ham not spam) to the email address 
"[MAILTO] [EMAIL PROTECTED]" to pump good messages into the feeder, and send 
spam messages (spam not ham) to the email address "[MAILTO] [EMAIL PROTECTED]" 
to pump spam messages into the feeder. It is a good idea to activate SMTP AUTH 
and replace thisdomain.com with a domain not listed as a server in 
<servernames> in config.xml: this way only authenticated users can feed the 
corpus. An example of addresses to use could be "[MAILTO] [EMAIL PROTECTED]" 
and "[MAILTO] [EMAIL PROTECTED]".

[1] http://en.wikipedia.org/wiki/Bayesian_probability

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

"How to prevent spam" new FAQ entry

Reply via email to