I'd like to improve this text, but I'm not sure I fully understand the concept
of "training" the mailet.
Any suggestions/explanations would be gladly received! ;D
Thanks, Alessandro
--
DELGADO, Alessandro
adelgado1313 [at] gmail [dot] com
Rio de Janeiro, RJ - BRASIL
_________________________________________________________________
Discover the new Windows Vista
http://search.msn.com/results.aspx?q=windows+vista&mkt=en-US&form=QBRE
How to prevent Spam?
There are multiple ways to prevent Spam with James, the more popular being the
BayesianAnalisys mailet. It scans the message, and determines the probability
that it is spam, using [bayesian probability theory techniques|1].
The analysis capabilities are based on token frequencies (the corpus) learned
through a training process using the BayesianAnalysisFeeder mailet and stored
in a JDBC database. During mailet initialization the corpus is loaded from the
database and kept in memory.
After a training session, the corpus must be rebuilt from the database in order
to acquire the new frequencies. Every 10 minutes a special thread will check if
any change was made to the database by the feeder, and rebuild the corpus for
this mailet if necessary.
A org.apache.james.spam.probability mail attribute will be created containing
the computed spam probability as a java.lang.Double. A message header string
named as specified in the headerName init parameter will be created containing
such probability in floating point representation.
The probability of being spam is pre-pended to the subject if it is > 0.1 (10%).
Here follows an example of config.xml definitions deploying the analysis mailet:
<mailet match="All" class="BayesianAnalysis"
onMailetException="ignore">
<repositoryPath>db://maildb</repositoryPath>
<maxSize>200000</maxSize>
<headerName>X-MessageIsSpamProbability</headerName>
<ignoreLocalSender>true</ignoreLocalSender>
</mailet>
<mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability >
0.90" class="AddHeader" onMatchException="noMatch">
<name>X-MessageIsSpam</name>
<value>true</value>
</mailet>
<mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability >
0.99" class="ToProcessor" onMatchException="noMatch">
<processor> spam </processor>
<notice>Spam not accepted</notice>
</mailet>
The required tables are automatically created if not already there (see
sqlResources.xml). The token field in both the ham and spam tables is case
sensitive.
The parameters' syntax is as follows:
<repositoryPath>: an URL pointing to the <data-source> containing the
database tables used (typically db://maildb).
<headerName>: the header name to add with the spam probability (default
is X-MessageIsSpamProbability).
<ignoreLocalSender>: true if you want to ignore messages coming from
local senders (default is false). By local sender we mean a return-path with a
local server part (server listed in <servernames> in config.xml)..
<maxSize>: the maximum message size (in bytes) that a message may have
to be considered spam (default is 100000).
The BayesianAnalysisFeeder mailet feeds ham OR spam messages to train the
BayesianAnalysis mailet.
The new token frequencies are stored in a JDBC database.
The bayesian database tables are updated during the training reflecting the new
data.
The correct approach is to send the original ham/spam message as an attachment
to another message sent to the feeder; all the headers of the enveloping
message will be removed and only the original message's tokens will be analyzed
and used for feeding. This because all the tokens of a message are examined by
the BayesianAnalysis mailet (including headers), and hence the feeding process
must be consistent.
After a training session, the frequency corpus used by the BayesianAnalysis
mailet must be rebuilt from the database, in order to take advantage of the new
token frequencies. Every 10 minutes a special thread in the BayesianAnalysis
mailet will check if any change was made to the database, and rebuild its
corpus if necessary.
Only one message at a time is scanned (the database update activity is
synchronized) in order to avoid too much database locking, as thousands of rows
may be updated just for one message being fed.
Here follows an example of config.xml definitions deploying the feeder mailet:
<!-- "not spam (ham)" bayesian analysis feeder. -->
<mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>ham</feedType>
<maxSize>200000</maxSize>
</mailet>
<!-- "spam" bayesian analysis feeder. -->
<mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>spam</feedType>
<maxSize>200000</maxSize>
</mailet>
The parameter's syntax is as follows:
<repositoryPath>: an URL pointing to the <data-source> containing the
database tables used (typically db://maildb).
<feedType>: the type of message being fed. The possible values are
either ham (good messages) or spam.
<maxSize>: the maximum message size (in bytes) that a message may have
to be considered spam (default is 100000).
The previous example will allow the user to send messages to the server and use
the recipient email address as the indicator for whether the message is ham or
spam.
Using the example above, send good messages (ham not spam) to the email address
"[MAILTO] [EMAIL PROTECTED]" to pump good messages into the feeder, and send
spam messages (spam not ham) to the email address "[MAILTO] [EMAIL PROTECTED]"
to pump spam messages into the feeder. It is a good idea to activate SMTP AUTH
and replace thisdomain.com with a domain not listed as a server in
<servernames> in config.xml: this way only authenticated users can feed the
corpus. An example of addresses to use could be "[MAILTO] [EMAIL PROTECTED]"
and "[MAILTO] [EMAIL PROTECTED]".
[1] http://en.wikipedia.org/wiki/Bayesian_probability
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]