On Fri, 13 Mar 2009, decoder wrote:

John Hardin wrote:

 <kggbph.617...@localhost>
BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL

Yes this is certainly possible. Basically all the algorithm requires for the SVM is the rules that hit and the classification (ham or spam) (actually the rules that did not hit are fed into the SVM as well, but they are taken from a the global rules file underlying the model). The tool additionally requires the score to evaluate FP/FN properly when testing the model,

It needs the score, and not just Y/N Spam/Ham (i.e. from which corpa file it came)?

 Then an external tool could generate and maintain these files from the SA
 log and the maintained training corpa, omitting FP/FN from the log data.

Yes, that's a good idea, certainly better than learning directly from the mail which might be scattered around several mailboxes. However, how do you want to exclude FP/FNs? The log certainly doesn't provide this information.

I was thinking you'd generate a ham file and a spam file from the log, possibly dynamically appending rows as messages are processed. Naturally this would contain FPs and FNs.

You'd have a routine to extract the ham file from your full ham corpus/corpa, and likewise for spam. The assumption is any FP or FN would be placed into these corpa for normal bayes training.

The tool would then combine them, omitting from the log-generated files any msgid that appears in the training corpa files. You'd end up with one clean spam file and one clean ham file.

I do note this would be a simpler and faster operation in a relational database, but I don't want to throw _that_ curve into the mix quite yet. Perl hashes might be sufficient.

On the other side, having some false positives in the training data did not spoil my results. The algorithm did even predict these correctly as spam later on :)

Er, don't you mean it predicted them as ham (FP = ham scored as spam)? It would be great if it was smart enough to recognize a near-boundary false result as what it *should* have been...

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  One difference between a liberal and a pickpocket is that if you
  demand your money back from a pickpocket he will not question your
  motives.                                          -- William Rusher
-----------------------------------------------------------------------
 Tomorrow: Albert Einstein's 130th Birthday

Reply via email to