Hi,
You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
the patterns; you can then write rules based on these.
I did so, the results are interesting, though I do not really know where
to go from there. If I take the first 50 "best" patterns and strip off the
obvious stand-alone words and sure-to-be-false-positive expressions, here
is what I get to: (sorry for non French speakers, explanation below)
RATIO SPAM% HAM% DATA
1.000 9.375 0.000 /Pour ne plus recevoir /
1.000 6.875 0.000 /6 janvier 1978 relative /
1.000 6.875 0.000 /affiche pas correctement, vous pouvez le visualiser en/
1.000 5.625 0.000 /s données nominatives /
1.000 5.625 0.000 / ce message, cliquez-ici/
1.000 5.625 0.000 / vous désinscrire de /
1.000 5.000 0.000 /Conformément à l/
1.000 5.000 0.000 / plus recevoir d\'informations de notre part/
1.000 5.000 0.000 /un droit d\'accès/
1.000 4.375 0.000 /ment Ã| l\'article 34 de la loi/
1.000 4.375 0.000 /ment à l\'article 34 de la loi /
1.000 3.750 0.000 /ous désinscrire de notre /
1.000 3.750 0.000 /es nominatives vous concernant\. /
1.000 3.750 0.000 / Libertés du 6 /
1.000 3.750 0.000 /es vous concernant\. Pour l\'exercer, /
As you can see, charset encoding makes a mess, and many must be regrouped.
Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).
The whole result is available at
http://www.saphirtech.fr/spam/seekrules_fr_1.txt
http://taint.org/x/2008/seekrules_run
I also adapted this one (paths of course, but also forced "mbox" format,
"detect" spit out zero results), but the result is even less "readable"
for me. I miss the script seekrules/kill_bad_patterns which I presume
removes stand alone words and such things.
Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt
John