[James Wiki] Update of "Bayesian Analysis" by VincenzoGianferrari

Apache Wiki Fri, 20 May 2005 11:10:52 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "James Wiki" for change 
notification.


The following page has been changed by VincenzoGianferrari:
http://wiki.apache.org/james/Bayesian_Analysis

------------------------------------------------------------------------------
  
  It is based upon the principals described in ''A Plan For Spam'' 
(http://www.paulgraham.com/spam.html) by Paul Graham, and has been extended to 
his ''Better Bayesian Filtering'' (http://paulgraham.com/better.html).
  
- The analysis capabilities are based on token frequencies (the ''Corpus'') 
learned through a training process using the '''B''''''ayesianAnalysisFeeder''' 
mailet (see below) and stored in a JDBC database.
+ The analysis capabilities are based on token frequencies (the ''corpus'') 
learned through a training process using the '''B''''''ayesianAnalysisFeeder''' 
mailet (see below) and stored in a JDBC database. During mailet initialization 
the corpus is loaded (built) from the database and kept in memory.
  
- After a training session, the Corpus must be rebuilt from the database in 
order to acquire the new frequencies. Every 10 minutes a special thread will 
check if any change was made to the database by the feeder, and rebuild the 
corpus for this mailet if necessary.
+ After a training session, the corpus must be rebuilt from the database in 
order to acquire the new frequencies. Every 10 minutes a special thread will 
check if any change was made to the database by the feeder, and rebuild the 
corpus for this mailet if necessary.
  
  A '''org.apache.james.spam.probability''' mail attribute will be created 
containing the computed spam probability as a java.lang.Double.
  A ''message header'' string named as specified in the '''headerName''' init 
parameter will be created containing such probability in floating point 
representation.
@@ -20, +20 @@

  The init parameters are as follows:
  
   *    '''<repositoryPath>''': an url pointing to the <data-source> containing 
the database tables used (typically ''db://maildb'').
-  *    '''<headerName>''': the header name to add with the spam probability 
(default is ''X-MessageIsSpamProbability'').
+  *    '''<headerName>''': the header name to add with the spam probability 
(default is ''X-M''''''essageIsSpamProbability'').
-  *    '''<ignoreLocalSender>''': true if you want to ignore messages coming 
from local senders (default is false). By ''local sender'' we mean a 
''return-path'' with a local server part (server listed in <servernames> in 
config.xml)..
+  *    '''<ignoreLocalSender>''': ''true'' if you want to ignore messages 
coming from local senders (default is ''false''). By ''local sender'' we mean a 
''return-path'' with a local server part (server listed in <servernames> in 
config.xml)..
   *    '''<maxSize>''': the maximum message size (in bytes) that a message may 
have to be considered spam (default is ''100000'').
  
  The probability of being spam is pre-pended to the subject if it is > 0.1 
(10%).
@@ -62, +62 @@

  
  == BayesianAnalysisFeeder mailet ==
  
- The '''''B''''''ayesianAnalysisFeeder''''' mailet feeds ham OR spam messages 
to train the '''B''''''ayesianAnalysis''' mailet.
+ The '''''B''''''ayesianAnalysisFeeder''''' mailet feeds ham OR spam messages 
to train the '''B''''''ayesianAnalysis''' mailet (see above).
  
  The new token frequencies are stored in a JDBC database.
  
  The bayesian database tables are updated during the training reflecting the 
new data.
  At the end the mail is destroyed (ghosted).
  
- '''The correct approach is to send the original ham/spam message as an 
attachment to another message sent to the feeder; all the headers of the 
enveloping message will be removed and only the original message's tokens will 
be analyzed'''.
+ '''The correct approach is to send the original ham/spam message as an 
attachment to another message sent to the feeder; all the headers of the 
enveloping message will be removed and only the original message's tokens will 
be analyzed and used for feeding'''.
+ This because '''all''' the tokens of a message are examined by the 
B''''''ayesianAnalysis mailet (''including headers''), and hence the feeding 
process must be consistent.
  
- After a training session, the frequency ''Corpus'' used by the 
'''B''''''ayesianAnalysis''' mailet must be rebuilt from the database, in order 
to take advantage of the new token frequencies.
+ After a training session, the frequency ''corpus'' used by the 
B''''''ayesianAnalysis mailet must be rebuilt from the database, in order to 
take advantage of the new token frequencies.
- Every 10 minutes a special thread in the '''B''''''ayesianAnalysis''' mailet 
will check if any change was made to the database, and rebuild the ''Corpus'' 
if necessary.
+ Every 10 minutes a special thread in the B''''''ayesianAnalysis mailet will 
check if any change was made to the database, and rebuild its corpus if 
necessary.
  
- Only one message at a time is scanned (the database update activity is 
''synchronized'') in order to avoid too much database locking, as thousands of 
rows may be updated just for one message fed.
+ Only one message at a time is scanned (the database update activity is 
''synchronized'') in order to avoid too much database locking, as thousands of 
rows may be updated just for one message being fed.
  
  === Initialization Parameters ===
  
@@ -93, +94 @@

  ...
  
           <!-- "not spam" bayesian analysis feeder. -->
-          <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
+          <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
              <repositoryPath> db://maildb </repositoryPath>
              <feedType>ham</feedType>
            <maxSize>200000</maxSize>
           </mailet>
  
           <!-- "spam" bayesian analysis feeder. -->
-          <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
+          <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder">
              <repositoryPath> db://maildb </repositoryPath>
              <feedType>spam</feedType>
            <maxSize>200000</maxSize>
@@ -112, +113 @@

  
  The previous example will allow the user to send messages to the server and 
use the recipient email address as the indicator for whether the message is ham 
or spam.
  
- Using the example above, send good messages (ham not spam) to the email 
address "[EMAIL PROTECTED]" to pump good messages into the feeder, and send 
spam messages (spam not ham) to the email address ''[EMAIL PROTECTED]'' to pump 
spam messages into the feeder.
+ Using the example above, send good messages (ham not spam) to the email 
address "[EMAIL PROTECTED]" to pump good messages into the feeder, and send 
spam messages (spam not ham) to the email address "[EMAIL PROTECTED]" to pump 
spam messages into the feeder.
+ It is a good idea to activate SMTP AUTH and replace ''thisdomain.com'' with a 
domain ''not'' listed as a server in <servernames> in config.xml: this way only 
authenticated users can feed the corpus. An example of addresses to use could 
be "[EMAIL PROTECTED]" and "[EMAIL PROTECTED]".

[James Wiki] Update of "Bayesian Analysis" by VincenzoGianferrari

Reply via email to