On 12/20/18 8:34 PM, Grant Taylor wrote:
I'm going back through and analyzing how I'm extracting data and trying to satisfactorily explain some oddities.

Out of 244,921 messages there are 16,528 unique addresses, this is how the messages break down for

Here's how the dots in the user parts of 16,528 unique addresses out of 244,921 messages break down:

  13,277               (no dots 80.3%)
   2,936 .             ( 1 dot  17.7%)
     281 ..            ( 2 dots  1.7%)
      29 ...           ( 3 dots  0.2%)
       3 ....          ( 4 dots  0.0%)
       1 .....         ( 5 dots  0.0%)
       1 ...........   (11 dots  0.0%)

So, in light of this information, I would be willing to concede 3 or more dots is possibly and indicator of spam.

My previous log methodology would add the following spam score to messages with 3 or more dots. (Assuming 3 dots is the number we start adding to the spam score.)

 3 dots = 1
 4 dots = 1.26
 5 dots = 1.46
11 dots = 2.18

Assuming 2 dots are allowed and is the number:

 3 dots = 1.58
 4 dots = 2.00
 5 dots = 2.32
11 dots = 3.46

I think I would be comfortable blindly adding log$Base($numberOfDots) (when numberOfDots > $Base) to the spam score. I don't even see a need to mess with a meta rule.



--
Grant. . . .
unix || die

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to