Dear Wiki user, You have subscribed to a wiki page or wiki category on "James Wiki" for change notification.
The following page has been changed by VincenzoGianferrari: http://wiki.apache.org/james/Bayesian_Analysis ------------------------------------------------------------------------------ - = Bayesian Analysis - spam detection mailets using bayesian analysis techniques = + = Spam detection mailets using bayesian analysis techniques = + == BayesianAnalysis mailet == + + The '''''B''''''ayesianAnalysis''''' mailet scans a message and determines the probability that it is '''spam''', using ''bayesian probability theory'' techniques. + + It is based upon the principals described in ''A Plan For Spam'' (http://www.paulgraham.com/spam.html) by Paul Graham, and has been extended to his ''Better Bayesian Filtering'' (http://paulgraham.com/better.html). + + The analysis capabilities are based on token frequencies (the ''Corpus'') learned through a training process using the '''B''''''ayesianAnalysisFeeder''' mailet (see below) and stored in a JDBC database. + + After a training session, the Corpus must be rebuilt from the database in order to acquire the new frequencies. Every 10 minutes a special thread will check if any change was made to the database by the feeder, and rebuild the corpus for this mailet if necessary. + + A '''org.apache.james.spam.probability''' mail attribute will be created containing the computed spam probability as a java.lang.Double. + A ''message header'' string named as specified in the '''headerName''' init parameter will be created containing such probability in floating point representation. + + === Initialization Parameters === + + The init parameters are as follows: + + * '''<repositoryPath>''': an url pointing to the <data-source> containing the database tables used (typically ''db://maildb''). + * '''<headerName>''': the header name to add with the spam probability (default is ''X-MessageIsSpamProbability''). + * '''<ignoreLocalSender>''': true if you want to ignore messages coming from local senders (default is false). By ''local sender'' we mean a ''return-path'' with a local server part (server listed in <servernames> in config.xml).. + * '''<maxSize>''': the maximum message size (in bytes) that a message may have to be considered spam (default is ''100000''). + + The probability of being spam is pre-pended to the subject if it is > 0.1 (10%). + + The required tables are automatically created if not already there (see sqlResources.xml). + The token field in both the ham and spam tables is '''case sensitive'''. + + === A James config.xml example === + + Here follows an example of '''config.xml''' definitions deploying the analysis mailet: + + {{{ + + ... + + <mailet match="All" class="BayesianAnalysis" onMailetException="ignore"> + <repositoryPath>db://maildb</repositoryPath> + <maxSize>200000</maxSize> + <headerName>X-MessageIsSpamProbability</headerName> + <ignoreLocalSender>true</ignoreLocalSender> + </mailet> + + <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.90" class="AddHeader" onMatchException="noMatch"> + <name>X-MessageIsSpam</name> + <value>true</value> + </mailet> + + <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.99" class="ToProcessor" onMatchException="noMatch"> + <processor> spam </processor> + <notice>Spam not accepted</notice> + </mailet> + + ... + + }}} + + + + == BayesianAnalysisFeeder mailet == + + The '''''B''''''ayesianAnalysisFeeder''''' mailet feeds ham OR spam messages to train the '''B''''''ayesianAnalysis''' mailet. + + The new token frequencies are stored in a JDBC database. + + The bayesian database tables are updated during the training reflecting the new data. + At the end the mail is destroyed (ghosted). + + '''The correct approach is to send the original ham/spam message as an attachment to another message sent to the feeder; all the headers of the enveloping message will be removed and only the original message's tokens will be analyzed'''. + + After a training session, the frequency ''Corpus'' used by the '''B''''''ayesianAnalysis''' mailet must be rebuilt from the database, in order to take advantage of the new token frequencies. + Every 10 minutes a special thread in the '''B''''''ayesianAnalysis''' mailet will check if any change was made to the database, and rebuild the ''Corpus'' if necessary. + + Only one message at a time is scanned (the database update activity is ''synchronized'') in order to avoid too much database locking, as thousands of rows may be updated just for one message fed. + + === Initialization Parameters === + + The init parameters are as follows: + + * '''<repositoryPath>''': an url pointing to the <data-source> containing the database tables used (typically ''db://maildb''). + * '''<feedType>''': the type of message being fed. The possible values are either ''ham'' (good messages) or ''spam''. + * '''<maxSize>''': the maximum message size (in bytes) that a message may have to be considered spam (default is ''100000''). + + === A James config.xml example === + + Here follows an example of '''config.xml''' definitions deploying the feeder mailet: + + {{{ + + ... + + <!-- "not spam" bayesian analysis feeder. --> + <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder"> + <repositoryPath> db://maildb </repositoryPath> + <feedType>ham</feedType> + <maxSize>200000</maxSize> + </mailet> + + <!-- "spam" bayesian analysis feeder. --> + <mailet match="[EMAIL PROTECTED]" class="BayesianAnalysisFeeder"> + <repositoryPath> db://maildb </repositoryPath> + <feedType>spam</feedType> + <maxSize>200000</maxSize> + </mailet> + + ... + + }}} + + The previous example will allow the user to send messages to the server and use the recipient email address as the indicator for whether the message is ham or spam. + + Using the example above, send good messages (ham not spam) to the email address "[EMAIL PROTECTED]" to pump good messages into the feeder, and send spam messages (spam not ham) to the email address ''[EMAIL PROTECTED]'' to pump spam messages into the feeder. +