[James Wiki] Update of "Bayesian Analysis" by VincenzoGianferrari

Apache Wiki Thu, 19 May 2005 14:02:43 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "James Wiki" for change 
notification.


The following page has been changed by VincenzoGianferrari:
http://wiki.apache.org/james/Bayesian_Analysis

------------------------------------------------------------------------------
- = Bayesian Analysis - spam detection mailets using bayesian analysis 
techniques =
+ = Spam detection mailets using bayesian analysis techniques =
  
+ == BayesianAnalysis mailet ==
+ 
+ The '''''B''''''ayesianAnalysis''''' mailet scans a message and determines 
the probability that it is '''spam''', using ''bayesian probability theory'' 
techniques.
+ 
+ It is based upon the principals described in ''A Plan For Spam'' 
(http://www.paulgraham.com/spam.html) by Paul Graham, and has been extended to 
his ''Better Bayesian Filtering'' (http://paulgraham.com/better.html).
+ 
+ The analysis capabilities are based on token frequencies (the ''Corpus'') 
learned through a training process using the '''B''''''ayesianAnalysisFeeder''' 
mailet (see below) and stored in a JDBC database.
+ 
+ After a training session, the Corpus must be rebuilt from the database in 
order to acquire the new frequencies. Every 10 minutes a special thread will 
check if any change was made to the database by the feeder, and rebuild the 
corpus for this mailet if necessary.
+ 
+ A '''org.apache.james.spam.probability''' mail attribute will be created 
containing the computed spam probability as a java.lang.Double.
+ A ''message header'' string named as specified in the '''headerName''' init 
parameter will be created containing such probability in floating point 
representation.
+ 
+ === Initialization Parameters ===
+ 
+ The init parameters are as follows:
+ 
+  *    '''<repositoryPath>''': an url pointing to the <data-source> containing 
the database tables used (typically ''db://maildb'').
+  *    '''<headerName>''': the header name to add with the spam probability 
(default is ''X-MessageIsSpamProbability'').
+  *    '''<ignoreLocalSender>''': true if you want to ignore messages coming 
from local senders (default is false). By ''local sender'' we mean a 
''return-path'' with a local server part (server listed in <servernames> in 
config.xml)..
+  *    '''<maxSize>''': the maximum message size (in bytes) that a message may 
have to be considered spam (default is ''100000'').
+ 
+ The probability of being spam is pre-pended to the subject if it is > 0.1 
(10%).
+ 
+ The required tables are automatically created if not already there (see 
sqlResources.xml).
+ The token field in both the ham and spam tables is '''case sensitive'''.
+ 
+ === A James config.xml example ===
+ 
+ Here follows an example of '''config.xml''' definitions deploying the 
analysis mailet:
+ 
+ {{{
+ 
+ ...
+ 
+          <mailet match="All" class="BayesianAnalysis" 
onMailetException="ignore">
+             <repositoryPath>db://maildb</repositoryPath>
+             <maxSize>200000</maxSize>
+             <headerName>X-MessageIsSpamProbability</headerName>
+             <ignoreLocalSender>true</ignoreLocalSender>
+          </mailet>
+      
+          <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability 
> 0.90" class="AddHeader" onMatchException="noMatch">
+             <name>X-MessageIsSpam</name>
+             <value>true</value>
+          </mailet>
+ 
+          <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability 
> 0.99" class="ToProcessor" onMatchException="noMatch">
+             <processor> spam </processor>
+             <notice>Spam not accepted</notice>
+          </mailet>
+ 
+ ...
+ 
+ }}}
+ 
+ 
+ 
+ == BayesianAnalysisFeeder mailet ==
+ 
+ The '''''B''''''ayesianAnalysisFeeder''''' mailet feeds ham OR spam messages 
to train the '''B''''''ayesianAnalysis''' mailet.
+ 
+ The new token frequencies are stored in a JDBC database.
+ 
+ The bayesian database tables are updated during the training reflecting the 
new data.
+ At the end the mail is destroyed (ghosted).
+ 
+ '''The correct approach is to send the original ham/spam message as an 
attachment to another message sent to the feeder; all the headers of the 
enveloping message will be removed and only the original message's tokens will 
be analyzed'''.
+ 
+ After a training session, the frequency ''Corpus'' used by the 
'''B''''''ayesianAnalysis''' mailet must be rebuilt from the database, in order 
to take advantage of the new token frequencies.
+ Every 10 minutes a special thread in the '''B''''''ayesianAnalysis''' mailet 
will check if any change was made to the database, and rebuild the ''Corpus'' 
if necessary.
+ 
+ Only one message at a time is scanned (the database update activity is 
''synchronized'') in order to avoid too much database locking, as thousands of 
rows may be updated just for one message fed.
+ 
+ === Initialization Parameters ===
+ 
+ The init parameters are as follows:
+ 
+  *    '''<repositoryPath>''': an url pointing to the <data-source> containing 
the database tables used (typically ''db://maildb'').
+  *    '''<feedType>''': the type of message being fed. The possible values 
are either ''ham'' (good messages) or ''spam''.
+  *    '''<maxSize>''': the maximum message size (in bytes) that a message may 
have to be considered spam (default is ''100000'').
+ 
+ === A James config.xml example ===
+ 
+ Here follows an example of '''config.xml''' definitions deploying the feeder 
mailet:
+ 
+ {{{
+ 
+ ...
+ 
+          <!-- "not spam" bayesian analysis feeder. -->
+          <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
+             <repositoryPath> db://maildb </repositoryPath>
+             <feedType>ham</feedType>
+           <maxSize>200000</maxSize>
+          </mailet>
+ 
+          <!-- "spam" bayesian analysis feeder. -->
+          <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
+             <repositoryPath> db://maildb </repositoryPath>
+             <feedType>spam</feedType>
+           <maxSize>200000</maxSize>
+          </mailet>
+ 
+ ...
+ 
+ }}}
+ 
+ The previous example will allow the user to send messages to the server and 
use the recipient email address as the indicator for whether the message is ham 
or spam.
+ 
+ Using the example above, send good messages (ham not spam) to the email 
address "[EMAIL PROTECTED]" to pump good messages into the feeder, and send 
spam messages (spam not ham) to the email address ''[EMAIL PROTECTED]'' to pump 
spam messages into the feeder.
+

[James Wiki] Update of "Bayesian Analysis" by VincenzoGianferrari

Reply via email to