http://bugzilla.spamassassin.org/show_bug.cgi?id=2129





------- Additional Comments From [EMAIL PROTECTED]  2004-03-13 17:16 -------
Subject: Re:  Bayes tweaks to test 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


OK, here's the results.

First pass:

base: current SVN

bug3118: with Henry's fix for bug 3118.  In order to test this,
I used an unbalanced corpus of 39987 ham and 23337 spam.

decomp: using "decomposing" tokens: namely if the token "Foo!"
appears, decompose that into "Foo!" "Foo" "foo!" and "foo".
In other words, make dup tokens with nonalphanumerics and case
stripped.

dhm1: "dual header map" variant 1: Dan's first suggestion above;
mapping "In-Reply-To" and "Message-Id" tokens into a shared
token, so that a ref to a previously-learned Message-Id in the
IRT header will be a hit.

dhm2: similar for From, To and CC headers

dhm3: similar for X-Mailer and User-Agent headers

Then I threw in a couple of retests.  Some of our old tokenizer
tweaks may be smelling a little off by this stage, so they need
a test.

ignmid: ignore Message-Id headers -- just testing this out, as
it's a large source of hapaxes.

Results:

base:    0.30/0.70  fp     3 fn   360 uh   193 us  3952    c 804.50
bug3118: 0.30/0.70  fp     2 fn   336 uh   207 us  4080    c 784.70
decomp:  0.30/0.70  fp     1 fn   324 uh   187 us  3981    c 750.80
dhm1:    0.30/0.70  fp     3 fn   344 uh   220 us  3867    c 782.70
dhm2:    0.30/0.70  fp     3 fn   343 uh   224 us  3709    c 766.30
dhm3:    0.30/0.70  fp     4 fn   342 uh   206 us  3886    c 791.20
ignmid:  0.30/0.70  fp     1 fn   383 uh   184 us  4020    c 813.40

(Don't forget -- compare all of these with "base", not with each
other.  They're all complementary so far.)

Clearly decomp is a *big* win, by far! "ignmid" is not so hot, as there's
a lot of missed spam as a result.  "bug3118" looks good overall. dhm1 and
dhm2 seem good, dhm3 borderline due to the new FP.


Test set 2:

try1: bug3118 + decomp + dhm1 + dhm2 -- ie best of previous run

try2: bug3118 + decomp + dhm1 + dhm2 + dhm3 -- giving dhm3 a second
chance.

hdrs_no_num: try1, with an extra tweak; NO_NUMERIC_IN_HEADERS
is turned on.  I suspect the decomposed numeric tokens (ie.
"8139" -> "N:NNNN") added to catch patterns, are no longer
working well.

no_num: same as hdrs_no_num, but also with no numeric tokens
in the message body either.


Results:

hdrs_no_num: 0.30/0.70  fp     1 fn   266 uh   269 us  3804    c 683.30
no_num:      0.30/0.70  fp     1 fn   268 uh   260 us  3854    c 689.40
try1:        0.30/0.70  fp     2 fn   283 uh   238 us  3785    c 705.30
try2:        0.30/0.70  fp     2 fn   277 uh   251 us  3745    c 696.60


This time, try2 is looking good -- quite a bit better than try1.

Also, clearly, dropping numeric tokens is now a good idea;
both variants of that are a clear improvement.


Test set 3:

combined: try2 + no_num.

combined:0.30/0.70  fp     2 fn   260 uh   267 us  3826    c 689.30

So that's what's gone in as r9447.


I tried Dan's suggestion of looking up the dual-header-map tokens
instead of making dupe copies of them -- unfortunately it didn't 
work, getting bad numbers, so I dropped that.  Combining them into
1 duplicate header gets better accuracy for some reason.

I also updated the Bayes 10fold cross-validation scripts to work again
with current SVN, and wrote quite a bit more doco on how to run them.

Note that "sa-learn --dump --dbpath" is required for these to work, so
anyone who removes that will have to fix them ;)

Next: I'll see if I can figure out a good invisible-text tweak.  I
may have to add a new rendering API for that, specifically for Bayes.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAU7JqQTcbUG5Y7woRArjoAJwN2B3HltmR1VS1XIEQUtg34+CmNwCg4eXu
q0wcV3wSfPy2VRep1BklaZQ=
=/lK1
-----END PGP SIGNATURE-----





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to