Re: Experimental Plugin: MetaSVM

John Hardin Fri, 13 Mar 2009 15:40:50 -0700

On Fri, 13 Mar 2009, decoder wrote:

John Hardin wrote:
 <kggbph.617...@localhost>
BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL
Yes this is certainly possible. Basically all the algorithm requires for theSVM is the rules that hit and the classification (ham or spam) (actually therules that did not hit are fed into the SVM as well, but they are taken froma the global rules file underlying the model). The tool additionally requiresthe score to evaluate FP/FN properly when testing the model,

It needs the score, and not just Y/N Spam/Ham (i.e. from which corpa fileit came)?

 Then an external tool could generate and maintain these files from the SA
 log and the maintained training corpa, omitting FP/FN from the log data.
Yes, that's a good idea, certainly better than learning directly from themail which might be scattered around several mailboxes. However, how do youwant to exclude FP/FNs? The log certainly doesn't provide this information.

I was thinking you'd generate a ham file and a spam file from the log,possibly dynamically appending rows as messages are processed. Naturallythis would contain FPs and FNs.

You'd have a routine to extract the ham file from your full hamcorpus/corpa, and likewise for spam. The assumption is any FP or FN wouldbe placed into these corpa for normal bayes training.

The tool would then combine them, omitting from the log-generated filesany msgid that appears in the training corpa files. You'd end up with oneclean spam file and one clean ham file.

I do note this would be a simpler and faster operation in a relationaldatabase, but I don't want to throw _that_ curve into the mix quite yet.Perl hashes might be sufficient.

On the other side, having some false positives in the training data didnot spoil my results. The algorithm did even predict these correctly asspam later on :)

Er, don't you mean it predicted them as ham (FP = ham scored as spam)? Itwould be great if it was smart enough to recognize a near-boundary falseresult as what it *should* have been...


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  One difference between a liberal and a pickpocket is that if you
  demand your money back from a pickpocket he will not question your
  motives.                                          -- William Rusher
-----------------------------------------------------------------------
 Tomorrow: Albert Einstein's 130th Birthday

Re: Experimental Plugin: MetaSVM

Reply via email to