On 09/26, Adi wrote: > are part of some SPAM messages but normal messages too. > You should consider use long phrase to eliminate wrong matching. > Many Polish words have many meanings depending on the context.
Certainly proper rules that hit only spam would be preferable, but to make any decent attempt at that would require access to a bunch of Polish non-spam for testing, which I do not have. If you (or anybody) are regularly receiving non-spam in a language other than English (and willing to sort it into spam vs. non-spam folders), it would be valuable to the spamassassin project to run the testing script (masscheck) to report how many of your spams and non-spams each of the rules hit. You don't have to give anybody a copy of your emails, just the report of the hit counts. More info here: https://wiki.apache.org/spamassassin/NightlyMassCheck There's also stuff about automatic rule generation here that might be fun: https://wiki.apache.org/spamassassin/WritingRules#Automatic_rule_generation On 09/26, John Hardin wrote: > How do you get a one byte match for two-byte-long UTF-8-encoded > accented characters? Shouldn't it generate this: I believe it was putting 'export PERL_UNICODE=""' in my ~/.bashrc. Documentation is here: http://perldoc.perl.org/perlrun.html#*-C-[_number/list_]* Before I set that environment variable, as you said, I was getting two output characters per two byte long UTF-8 character. > Your rule doesn't hit in my test environment (though I just pasted > that word into an existing message to test...) Weird.