Hi,

You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
the patterns; you can then write rules based on these.

I did so, the results are interesting, though I do not really know where to go from there. If I take the first 50 "best" patterns and strip off the obvious stand-alone words and sure-to-be-false-positive expressions, here is what I get to: (sorry for non French speakers, explanation below)

 RATIO   SPAM%    HAM%   DATA
 1.000   9.375   0.000  /Pour ne plus recevoir /
 1.000   6.875   0.000  /6 janvier 1978 relative /
 1.000   6.875   0.000  /affiche pas correctement, vous pouvez le visualiser en/
 1.000   5.625   0.000  /s données nominatives /
 1.000   5.625   0.000  / ce message, cliquez-ici/
 1.000   5.625   0.000  / vous désinscrire de /
 1.000   5.000   0.000  /Conformément à l/
 1.000   5.000   0.000  / plus recevoir d\'informations de notre part/
 1.000   5.000   0.000  /un droit d\'accès/
 1.000   4.375   0.000  /ment Ã|  l\'article 34 de la loi/
 1.000   4.375   0.000  /ment à l\'article 34 de la loi /
 1.000   3.750   0.000  /ous désinscrire de notre /
 1.000   3.750   0.000  /es nominatives vous concernant\. /
 1.000   3.750   0.000  / Libertés du 6 /
 1.000   3.750   0.000  /es vous concernant\. Pour l\'exercer, /

As you can see, charset encoding makes a mess, and many must be regrouped.

Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't read this mail in html, click here).

The whole result is available at http://www.saphirtech.fr/spam/seekrules_fr_1.txt

 http://taint.org/x/2008/seekrules_run

I also adapted this one (paths of course, but also forced "mbox" format, "detect" spit out zero results), but the result is even less "readable" for me. I miss the script seekrules/kill_bad_patterns which I presume removes stand alone words and such things.

Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt

John

Reply via email to