On Fri, 13 Mar 2009, decoder wrote:
John Hardin wrote:
<kggbph.617...@localhost>
BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL
Yes this is certainly possible. Basically all the algorithm requires for the
SVM is the rules that hit and the classification (ham or spam) (actually the
rules that did not hit are fed into the SVM as well, but they are taken from
a the global rules file underlying the model). The tool additionally requires
the score to evaluate FP/FN properly when testing the model,
It needs the score, and not just Y/N Spam/Ham (i.e. from which corpa file
it came)?
Then an external tool could generate and maintain these files from the SA
log and the maintained training corpa, omitting FP/FN from the log data.
Yes, that's a good idea, certainly better than learning directly from the
mail which might be scattered around several mailboxes. However, how do you
want to exclude FP/FNs? The log certainly doesn't provide this information.
I was thinking you'd generate a ham file and a spam file from the log,
possibly dynamically appending rows as messages are processed. Naturally
this would contain FPs and FNs.
You'd have a routine to extract the ham file from your full ham
corpus/corpa, and likewise for spam. The assumption is any FP or FN would
be placed into these corpa for normal bayes training.
The tool would then combine them, omitting from the log-generated files
any msgid that appears in the training corpa files. You'd end up with one
clean spam file and one clean ham file.
I do note this would be a simpler and faster operation in a relational
database, but I don't want to throw _that_ curve into the mix quite yet.
Perl hashes might be sufficient.
On the other side, having some false positives in the training data did
not spoil my results. The algorithm did even predict these correctly as
spam later on :)
Er, don't you mean it predicted them as ham (FP = ham scored as spam)? It
would be great if it was smart enough to recognize a near-boundary false
result as what it *should* have been...
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
One difference between a liberal and a pickpocket is that if you
demand your money back from a pickpocket he will not question your
motives. -- William Rusher
-----------------------------------------------------------------------
Tomorrow: Albert Einstein's 130th Birthday