John GALLET wrote:
Re,

Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).

It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. ;)

Well, that's the whole point: can we conclude that an email with an unsubcribe link tends to be a spam more often than a ham ? I consider so, but with a low score. Can we conclude that an email citing the French Law "informatique et libertés" is a spam ? I would say "100% except government sponsored mailing lists that may feel obliged to do so", so I added a higher score. Now it might perfectly be faulty logic, I do not have any experience in spam fighting.

many mailing lists and safe newsletters contain such links. examples:
- mailing lists hosted by ovh
- HSC newsletter (Herve is not the kind of guy to participate in spam)
- Ciel (if you junk this, your accountants may junk your salary :)
- Air France (I want my tickets!)
...


same goes for "legal" stuff (nobody wants to miss his Air France electronic tickets...)

Things get even worst when ads are included in important mail. here is an 
excerpt for a mail from SNCF (confirmation):

<excerpt>
Pas encore membre ?
Inscrivez-vous dès aujourd'hui et gagnez déjà 100 Maximiles de bienvenue !
Pour en savoir plus, cliquez ici <http://www.maximiles.com/index.php?LIEN=mail/mail2/joinsncf>

</excerpt>

and french members may remember that maximiles participated to the infamous "Sarkozy spam". but apparently, they got cleaner since then (the address I use at sncf is [EMAIL PROTECTED], so I wouldn't miss it if they use it!).

here is an excerpt from a safe (and actually relatively closed) newsletter.
<exerpt>
Non, ceci n'est pas un SPAM, c'est la  lettre d'information de
...
Si vous désirez vous désabonner, ...
</excerpt>

(of course, you can argue that a message may not be a "SPAM" because you can't eat an email. but let's not be too pedantic :-p).

I did not run your rules on my corpus. I'll try to do so but my spam corpus is 
not classified by language.










Reply via email to