Thanks for the suggestions. I was able after all to get spamassassin
to work by loading the relevant rules. I actually loaded the default
rule set and then removed all then NN_XXX.cf files except
10_default_prefs and 23_bayes. I also added the following line:

    add_header all Bayes "_BAYES_"

so that I can extract the probability of spam according to Bayes algorithm.

I also tried dspam and did a comparison of the two. I used the
following methodology: I divided my data set into two parts; the first
one, consisting of 90% of the posts, was used for training and the
remaining 10% was used for testing. The following is what the two
programs reported for the 10% of the posts compared to what the forum
administrators did for these posts (for both programs I assumed that a
post is classified as spam if the reported probability is 55% or
higher.):

For the posts rejected by forum moderators:
spamassassin classified 19.0% of those as spam
dspam classified 15.0% of those as spam

For the posts approved by forum moderators:
spamassassin classified 99.3% of those as ham
dspam classified 99.7% of those as ham

As you can see, the performance of both was unsatisfactory as they
both failed to correctly identify 80% to 85% of the rejected posts. It
is also interesting to note that the posts that they actually did
correctly identify as rejected, were mostly posts with formatting
issues (such as posts being written in ALL CAPS which is forbidden by
forum rules) and not posts with issues related to the actual content
(such as off-topic posts). This of course is not surprising at all.

So the conclusion is that Bayesian spam filtering cannot be used for
this particular case.


On Mon, Oct 24, 2011 at 00:24, Henrik K <h...@hege.li> wrote:
> On Sun, Oct 23, 2011 at 06:35:02PM -0400, Marios Titas wrote:
>> Hi all,
>>
>> I was recently given a list of 10,000 posts from an internet forum.
>> Out of those, 9,000 had been aproved by the site's moderators and the
>> remaining were rejected. I was wondering if I could use this data set
>> to play with Bayesian filtering in spamassassin.
>
> Why don't you just try something like dspam and it's "DataSource document"
> option.  It should process non-email data just like that and probably work
> much more efficiently anyway.  SA Bayes heavily tuned for email messages and
> their quirks.
>
> Of course if would be interesting if someone put up a comparison.
>
>

Reply via email to