On Fri, 16 May 2014 07:22:56 -0400
"David F. Skoll" <d...@roaringpenguin.com> wrote:

James> Is there any way to limit Bayes content checking to only the
James> first X characters of the message body?  I ask this because it is
James> clear that the spam messages getting through contain text meant
James> to poison the tests but this gibberish always trails the main
James> message and is separated by a large white space in most cases.

David> In my experience, trying to be too clever with Bayes is
David> counter-productive.  Those Bayes-poisoning attacks rarely work on
David> a well-trained corpus.  You probably just need more training for
David> Bayes to figure out what's happening.

In the last few (~10) days, I have seen a marked increase in FNs,
usually with Bayes values in the 50s and 60s.  By marked, I mean I do
pretty much nothing but adjust my various ad-hoc rules to keep from
being flooded ;-\

On close inspection, I see that the hash-busting garbage appended is
(faux) technical computing talk instead of the usual cookbooks or
classical literature :-p  That is, scrambled Stack Overflow discussions
and the like.  And of course that is what most of my ham is about, so
it makes very good sense that Bayes gets confused.

I include a magic dump just in case something is wrong with my
training.  But if not, isn't this a situation where something like
James' suggestion would help?

 [4+0]~$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       5593          0  non-token data: nspam
0.000          0       6190          0  non-token data: nham
0.000          0     148413          0  non-token data: ntokens
0.000          0 1384366530          0  non-token data: oldest atime
0.000          0 1400253567          0  non-token data: newest atime
0.000          0 1400253356          0  non-token data: last journal sync atime
0.000          0 1395423790          0  non-token data: last expiry atime
0.000          0   11059200          0  non-token data: last expire atime delta
0.000          0      25914          0  non-token data: last expire reduction 
count

-- 
Please *no* private copies of mailing list or newsgroup messages.

Reply via email to