Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

darxus Fri, 26 Sep 2014 14:59:34 -0700

On 09/26, Adi wrote:
> are part of some SPAM messages but normal messages too.
> You should consider use long phrase to eliminate wrong matching.
> Many Polish words have many meanings depending on the context.

Certainly proper rules that hit only spam would be preferable, but to
make any decent attempt at that would require access to a bunch of Polish
non-spam for testing, which I do not have.

If you (or anybody) are regularly receiving non-spam in a language other
than English (and willing to sort it into spam vs. non-spam folders), it
would be valuable to the spamassassin project to run the testing script
(masscheck) to report how many of your spams and non-spams each of the
rules hit.  You don't have to give anybody a copy of your emails, just
the report of the hit counts.  More info here:

https://wiki.apache.org/spamassassin/NightlyMassCheck

There's also stuff about automatic rule generation here that might be fun:
https://wiki.apache.org/spamassassin/WritingRules#Automatic_rule_generation

On 09/26, John Hardin wrote:
> How do you get a one byte match for two-byte-long UTF-8-encoded
> accented characters? Shouldn't it generate this:

I believe it was putting 'export PERL_UNICODE=""' in my ~/.bashrc.
Documentation is here:
http://perldoc.perl.org/perlrun.html#*-C-[_number/list_]*

Before I set that environment variable, as you said, I was getting two
output characters per two byte long UTF-8 character.

> Your rule doesn't hit in my test environment (though I just pasted
> that word into an existing message to test...)

Weird.

Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Reply via email to