Hello OpenMacNews, Wednesday, May 5, 2004, 11:36:48 AM, you wrote:
>>> Can anyone comment as to whether it makes sense to simply score these >>> rules to "20-ish" as well? I naively am not aware of anyone ELSE >>> sending me MIME encoded messages other than a couple of in-house >>> scripts that i whitelist. >> Your best bet is to run the scoring against your own collection of >> mail and see how it does. The scores are low either because they don't >> indicate spam that well O> I've ~1 year's worth of Bayes data for my mail. SA is just O> phenomenal in stopping what it should stop. Including those that are O> mime-wrapped as mentioned ... O> I'm just curious as to _specifically_ the MIME scoring. Do you have your emails in a usable archive, or were they discarded after being fed through Bayes? If you'll specify /which/ rules you're talking about, I can let you know how they match up on my corpus here. More immediately, you can check the statistics files in the SA release you're running to see how they matched up during the final GA runs. That'll show you both spam and ham percentages. >> , or because they're always seen in conjunction with other rules that score >> high. O> These scores show up as *tested* in a LOT of properly flagged spam O> ... just with a low "score" . I guess i don't completely understand O> why/how the "in conjunction with other rules" results in low scores. O> If a test is "always there" in SPAM, shouldn't it be scored HIGH, O> regardless of other rules? SA's philosophy is that spam should be flagged as such, ham should be flagged as ham, and it's many, many, many times more important for ham to be correctly flagged than for spam to be correctly flagged. Therefore, if you have two rules that hit lots and lots of spam, and one of them hits one ham, that rule will score significantly lower than the other. Rules have two attributes in this evaluation: how likely are they to push spam over the threshold to be accurately flagged, and how likely are they to push ham over the threshold to be inaccurately flagged. The latter is avoided like the plague. The rules you're thinking about may be darn good rules, but just matched ham often enough in the massive GA corpus that their scores were lowered in favor of other rules that matched less ham. >> I'm personally of the belief that any rule that never shows up in >> ham should have a minimum score of 1.0 -- especially ones that detect >> broken ratware. O> Thanks! I don't hold with that -- it depends on how confident you are that the rule will /never/ match ham. I pay a lot of attention to the number of spam and ham hit. A normal rule (one which does not absolutely prove the email was generated by spamware, ratware, or identify the spammer's system or web site) which matches 100 spam in my corpus (and no ham) gets a decent score. That score increases until it reaches a maximum of 1/3 required hits at 200 spam (I jump that to 40% of required hits when rules hit 1000 spam and no ham). Likewise, that score decreases such that rules that hit only a handful of spam (and no ham) score very minimal scores. I figure if a rule matches 100-200 spam or more, and no ham, it's likely that this is a good spam-catching rule. If a rule matches only 5-10 spam, and no ham, it's very possible that it does match ham and I simply haven't seen that ham yet. Bob Menschel O> richard -- Best regards, Robert mailto:[EMAIL PROTECTED]
