seekrules over French spam (was Re: [Rule Set proposal] French Rules

John GALLET Mon, 23 Jun 2008 11:55:18 -0700

Hi,

You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
the patterns; you can then write rules based on these.

I did so, the results are interesting, though I do not really know whereto go from there. If I take the first 50 "best" patterns and strip off theobvious stand-alone words and sure-to-be-false-positive expressions, hereis what I get to: (sorry for non French speakers, explanation below)


 RATIO   SPAM%    HAM%   DATA
 1.000   9.375   0.000  /Pour ne plus recevoir /
 1.000   6.875   0.000  /6 janvier 1978 relative /
 1.000   6.875   0.000  /affiche pas correctement, vous pouvez le visualiser en/
 1.000   5.625   0.000  /s donnÃ©es nominatives /
 1.000   5.625   0.000  / ce message, cliquez-ici/
 1.000   5.625   0.000  / vous désinscrire de /
 1.000   5.000   0.000  /Conformément à l/
 1.000   5.000   0.000  / plus recevoir d\'informations de notre part/
 1.000   5.000   0.000  /un droit d\'accès/
 1.000   4.375   0.000  /ment Ã|  l\'article 34 de la loi/
 1.000   4.375   0.000  /ment à l\'article 34 de la loi /
 1.000   3.750   0.000  /ous désinscrire de notre /
 1.000   3.750   0.000  /es nominatives vous concernant\. /
 1.000   3.750   0.000  / LibertÃ©s du 6 /
 1.000   3.750   0.000  /es vous concernant\. Pour l\'exercer, /

As you can see, charset encoding makes a mess, and many must be regrouped.

Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL andFR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can'tread this mail in html, click here).

The whole result is available athttp://www.saphirtech.fr/spam/seekrules_fr_1.txt

 http://taint.org/x/2008/seekrules_run

I also adapted this one (paths of course, but also forced "mbox" format,"detect" spit out zero results), but the result is even less "readable"for me. I miss the script seekrules/kill_bad_patterns which I presumeremoves stand alone words and such things.


Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt

John

seekrules over French spam (was Re: [Rule Set proposal] French Rules

Reply via email to