Dear Wiki user, You have subscribed to a wiki page or wiki category on "James Wiki" for change notification.
The following page has been changed by VincenzoGianferrari: http://wiki.apache.org/james/Bayesian_Analysis ------------------------------------------------------------------------------ It is based upon the principals described in ''A Plan For Spam'' (http://www.paulgraham.com/spam.html) by Paul Graham, and has been extended to his ''Better Bayesian Filtering'' (http://paulgraham.com/better.html). - The analysis capabilities are based on token frequencies (the ''Corpus'') learned through a training process using the '''B''''''ayesianAnalysisFeeder''' mailet (see below) and stored in a JDBC database. + The analysis capabilities are based on token frequencies (the ''corpus'') learned through a training process using the '''B''''''ayesianAnalysisFeeder''' mailet (see below) and stored in a JDBC database. During mailet initialization the corpus is loaded (built) from the database and kept in memory. - After a training session, the Corpus must be rebuilt from the database in order to acquire the new frequencies. Every 10 minutes a special thread will check if any change was made to the database by the feeder, and rebuild the corpus for this mailet if necessary. + After a training session, the corpus must be rebuilt from the database in order to acquire the new frequencies. Every 10 minutes a special thread will check if any change was made to the database by the feeder, and rebuild the corpus for this mailet if necessary. A '''org.apache.james.spam.probability''' mail attribute will be created containing the computed spam probability as a java.lang.Double. A ''message header'' string named as specified in the '''headerName''' init parameter will be created containing such probability in floating point representation. @@ -20, +20 @@ The init parameters are as follows: * '''<repositoryPath>''': an url pointing to the <data-source> containing the database tables used (typically ''db://maildb''). - * '''<headerName>''': the header name to add with the spam probability (default is ''X-MessageIsSpamProbability''). + * '''<headerName>''': the header name to add with the spam probability (default is ''X-M''''''essageIsSpamProbability''). - * '''<ignoreLocalSender>''': true if you want to ignore messages coming from local senders (default is false). By ''local sender'' we mean a ''return-path'' with a local server part (server listed in <servernames> in config.xml).. + * '''<ignoreLocalSender>''': ''true'' if you want to ignore messages coming from local senders (default is ''false''). By ''local sender'' we mean a ''return-path'' with a local server part (server listed in <servernames> in config.xml).. * '''<maxSize>''': the maximum message size (in bytes) that a message may have to be considered spam (default is ''100000''). The probability of being spam is pre-pended to the subject if it is > 0.1 (10%). @@ -62, +62 @@ == BayesianAnalysisFeeder mailet == - The '''''B''''''ayesianAnalysisFeeder''''' mailet feeds ham OR spam messages to train the '''B''''''ayesianAnalysis''' mailet. + The '''''B''''''ayesianAnalysisFeeder''''' mailet feeds ham OR spam messages to train the '''B''''''ayesianAnalysis''' mailet (see above). The new token frequencies are stored in a JDBC database. The bayesian database tables are updated during the training reflecting the new data. At the end the mail is destroyed (ghosted). - '''The correct approach is to send the original ham/spam message as an attachment to another message sent to the feeder; all the headers of the enveloping message will be removed and only the original message's tokens will be analyzed'''. + '''The correct approach is to send the original ham/spam message as an attachment to another message sent to the feeder; all the headers of the enveloping message will be removed and only the original message's tokens will be analyzed and used for feeding'''. + This because '''all''' the tokens of a message are examined by the B''''''ayesianAnalysis mailet (''including headers''), and hence the feeding process must be consistent. - After a training session, the frequency ''Corpus'' used by the '''B''''''ayesianAnalysis''' mailet must be rebuilt from the database, in order to take advantage of the new token frequencies. + After a training session, the frequency ''corpus'' used by the B''''''ayesianAnalysis mailet must be rebuilt from the database, in order to take advantage of the new token frequencies. - Every 10 minutes a special thread in the '''B''''''ayesianAnalysis''' mailet will check if any change was made to the database, and rebuild the ''Corpus'' if necessary. + Every 10 minutes a special thread in the B''''''ayesianAnalysis mailet will check if any change was made to the database, and rebuild its corpus if necessary. - Only one message at a time is scanned (the database update activity is ''synchronized'') in order to avoid too much database locking, as thousands of rows may be updated just for one message fed. + Only one message at a time is scanned (the database update activity is ''synchronized'') in order to avoid too much database locking, as thousands of rows may be updated just for one message being fed. === Initialization Parameters === @@ -93, +94 @@ ... <!-- "not spam" bayesian analysis feeder. --> - <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder"> + <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder"> <repositoryPath> db://maildb </repositoryPath> <feedType>ham</feedType> <maxSize>200000</maxSize> </mailet> <!-- "spam" bayesian analysis feeder. --> - <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder"> + <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder"> <repositoryPath> db://maildb </repositoryPath> <feedType>spam</feedType> <maxSize>200000</maxSize> @@ -112, +113 @@ The previous example will allow the user to send messages to the server and use the recipient email address as the indicator for whether the message is ham or spam. - Using the example above, send good messages (ham not spam) to the email address "[EMAIL PROTECTED]" to pump good messages into the feeder, and send spam messages (spam not ham) to the email address ''[EMAIL PROTECTED]'' to pump spam messages into the feeder. + Using the example above, send good messages (ham not spam) to the email address "[EMAIL PROTECTED]" to pump good messages into the feeder, and send spam messages (spam not ham) to the email address "[EMAIL PROTECTED]" to pump spam messages into the feeder. + It is a good idea to activate SMTP AUTH and replace ''thisdomain.com'' with a domain ''not'' listed as a server in <servernames> in config.xml: this way only authenticated users can feed the corpus. An example of addresses to use could be "[EMAIL PROTECTED]" and "[EMAIL PROTECTED]".