On 01/09/2015 01:23 AM, Adam Katz wrote:
Ran these against my corpus. Here are the worst performers (lots in
common with RW's complaints):
*SPAM% HAM% S/O NAME*
0.013 0.153 0.080 __RULEGEN_PHISH_BLR6YY
0.006 0.286 0.022 __RULEGEN_PHISH_0ATBRI
0.008 0.334 0.023 __RULEGEN_PHISH_L3I0Z5
0.002 0.300 0.006 __RULEGEN_PHISH_LGYG7Q
0.017 1.387 0.012 __RULEGEN_PHISH_QVS6GE
0.045 2.490 0.018 __RULEGEN_PHISH_UNQ4VP
0.027 2.011 0.013 __RULEGEN_PHISH_B9HL3A
body __RULEGEN_PHISH_UNQ4VP / may contain information that is /
body __RULEGEN_PHISH_QVS6GE / or entity to which it is addressed/
body __RULEGEN_PHISH_B9HL3A /The information contained in this /
body __RULEGEN_PHISH_0ATBRI / it is addressed\. If you are n/
body __RULEGEN_PHISH_LGYG7Q / you have received it in error. /
body __RULEGEN_PHISH_BLR6YY /uthorised and regulated by the /
body __RULEGEN_PHISH_L3I0Z5 / is intended solely for the ..d/
A large number of the FPs come from Paypal and similar services.
Agreed, the rules are not close to ideal.
The spam corpus is ancient, the ham corpus is too small.
Even controlling for those, I haven't found the phishing ruleset useful
at all. The fraud rules do have limited utility.
Agreed - blam bad & stale data.
What relationship does this have to the 10+ year-old SARE stuff?
I was part of the SARE group, and saved the rules (for historical
reasons) to SF before the web site was shutdown for good.
As I don't have the means to set up a SA update channel, putting the
RULEGEN rules on SF was the only option I had left.