Re: Scoring for rule SUBJ_ILLEGAL_CHARS

2006-05-13 Thread Kai Schaetzl
Kelson wrote on Fri, 12 May 2006 14:23:55 -0700:

 I count two:  The ü in für and the ´ in MODEL´S, which is different from 
 the ASCII single quote/apostrophe: '

Ah, you are right, I missed the ü, it's too natural for me.
Nevertheless too many implies a bit more than *two* for me. I can't 
exactly say how much, but I'd use a better description. The rule is an eval 
rule, so I don't know how many characters it needs, maybe it's really just 
one.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: Scoring for rule SUBJ_ILLEGAL_CHARS

2006-05-12 Thread Kai Schaetzl
Theo Van Dinter wrote on Thu, 11 May 2006 13:49:11 -0400:

 fwiw, the 8-bit characters ought to be encoded in base64 or quoted-printable. 
 then the rule wouldn't hit.

I just found the same problem here with a whole bunch of messages coming from 
the same source. It seems the rule hits on *one* occurence of a non-ASCII 
character, however, the description says Subject: has too many raw illegal 
characters. At least the description is wrong then.
And, as Keith explains, I think that score is excessive. It's fairly common 
that 
some mail programs, especially if webmail or form-generated, have at least one 
none-encoded character in the subject.

The subject line hitting in the case of our customer was:
Bewerbung für INS-2006-05-4, MODEL´S GESUCHT!!!

I can identify only one character that is outside the ASCII range.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: Scoring for rule SUBJ_ILLEGAL_CHARS

2006-05-12 Thread Kelson

Kai Schaetzl wrote:

The subject line hitting in the case of our customer was:
Bewerbung für INS-2006-05-4, MODEL´S GESUCHT!!!

I can identify only one character that is outside the ASCII range.


I count two:  The ü in für and the ´ in MODEL´S, which is different from
the ASCII single quote/apostrophe: '

--
Kelson Vibber
SpeedGate Communications www.speed.net


Re: Scoring for rule SUBJ_ILLEGAL_CHARS

2006-05-12 Thread jdow

From: Kai Schaetzl [EMAIL PROTECTED]


Theo Van Dinter wrote on Thu, 11 May 2006 13:49:11 -0400:


fwiw, the 8-bit characters ought to be encoded in base64 or quoted-printable.
then the rule wouldn't hit.


I just found the same problem here with a whole bunch of messages coming from
the same source. It seems the rule hits on *one* occurence of a non-ASCII
character, however, the description says Subject: has too many raw illegal
characters. At least the description is wrong then.
And, as Keith explains, I think that score is excessive. It's fairly common that
some mail programs, especially if webmail or form-generated, have at least one
none-encoded character in the subject.

The subject line hitting in the case of our customer was:
Bewerbung für INS-2006-05-4, MODEL´S GESUCHT!!!

I can identify only one character that is outside the ASCII range.

Kai


1 is too many, of course.
{^_-} 



Scoring for rule SUBJ_ILLEGAL_CHARS

2006-05-11 Thread Keith Dunnett

I've recently had a couple of false positives caused by this rule, and think
it may be scored too highly for a single check. The e-mails in question were
in Spanish, and the Spanish word for linguistics has two accented characters
which is enough to trigger this rule.

Admittedly, the blacklists account for 2.4 points (it was from Yahoo) but the
4.3 point score for the subject alone strikes me as excessive. I understand 
that anything that is not English is inherently suspect for most users, but 
to give 86% of the default spam score on almost *any* single rule would seem 
to me to be overkill.


Alternatively, is there (or should there be) a ruleset for those who wish to
receive e-mail in other languages? Ideally, a Spanish-friendly ruleset would
reduce the scores of character-based rules, while adding in rules for known 
spam in Spanish where possible. Does such a thing already exist? Should it?


The spam report from the e-mail in question follows, although the above 
pretty much sums it up.


X-Spam-Report: 
	*  0.0 DK_POLICY_SIGNSOME Domain Keys: policy says domain signs some mails

*  0.0 DK_POLICY_TESTING Domain Keys: policy says domain is testing DK
*  4.3 SUBJ_ILLEGAL_CHARS Subject: has too many raw illegal characters
*  0.0 DK_SIGNED Domain Keys: message has an unverified signature
* -0.0 DK_VERIFIED Domain Keys: signature passes verification
*  0.5 HTML_40_50 BODY: Message is 40% to 50% HTML
*  0.0 HTML_MESSAGE BODY: HTML included in message
*  0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60%
*  [score: 0.5000]
*  0.2 DNS_FROM_RFC_ABUSE RBL: Envelope sender in abuse.rfc-ignorant.org
*  1.4 DNS_FROM_RFC_WHOIS RBL: Envelope sender in whois.rfc-ignorant.org
*  0.8 RCVD_IN_BLARS RBL: Received via a relay in block.blars.org
*  [217.216.40.199 listed in block.blars.org]
[66.163.178.160 listed in block.blars.org]
* -0.5 AWL AWL: From: address is in the auto white-list

Regards,

Keith



Re: Scoring for rule SUBJ_ILLEGAL_CHARS

2006-05-11 Thread Theo Van Dinter
On Thu, May 11, 2006 at 07:47:15PM +0200, Keith Dunnett wrote:
 I've recently had a couple of false positives caused by this rule, and think
 it may be scored too highly for a single check. The e-mails in question were
 in Spanish, and the Spanish word for linguistics has two accented characters
 which is enough to trigger this rule.

fwiw, the 8-bit characters ought to be encoded in base64 or quoted-printable.
then the rule wouldn't hit.

 Admittedly, the blacklists account for 2.4 points (it was from Yahoo) but 
 the
 4.3 point score for the subject alone strikes me as excessive. I understand 
 that anything that is not English is inherently suspect for most users, but 
 to give 86% of the default spam score on almost *any* single rule would 
 seem to me to be overkill.

It's actually less about english vs non-english and more about messages
violating the rfc (non 7-bit ascii chars need to be encoded in the header).
however, english maps to 7-bit ascii very well, so ...

-- 
Randomly Generated Tagline:
They who can give up essential liberty to obtain a little temporary
 safety deserve neither liberty nor safety. - Benjamin Franklin


pgp7y7upt628P.pgp
Description: PGP signature


Re: SUBJ_ILLEGAL_CHARS

2006-03-15 Thread Милен Панков

Matt Kettler написа:

Note that SUBJ_ILLEGAL_CHARS is NOT concerned with what language or character
set is used. It is concerned about it not being encoded properly.

Per RFC specifications, all characters in email-headers that aren't in the
normal ascii ranges must be QP encoded. This rule is essentially detecting that
the sender used extended range character sets, but their email client neglected
to properly QP encode it.
  

Can You please point which RFC is this and what exactly 'QP encoding' means.

Thanks,
Milen


Re: SUBJ_ILLEGAL_CHARS

2006-03-15 Thread Loren Wilton
 Can You please point which RFC is this and what exactly 'QP encoding'
means.

Someone else can doubtless point to the RFC, but as an example, your name in
the From address is encoded in Quoted Printable encoding.  I've added some
spaces to it below so that your mail client doesn't turn it back into
Cyrillic characters:

=? UTF-8 ? B ? 0JzQuNC70LXQvSDQn9Cw0L3QutC+0LI = ? =
[EMAIL PROTECTED]

The same general encoding should be used for a Subject line with non-ascii
characters, such as those in your name.

Loren



Re: SUBJ_ILLEGAL_CHARS

2006-03-15 Thread Craig Morrison

Милен Панков wrote:

Matt Kettler написа:
Note that SUBJ_ILLEGAL_CHARS is NOT concerned with what language or 
character

set is used. It is concerned about it not being encoded properly.

Per RFC specifications, all characters in email-headers that aren't in 
the
normal ascii ranges must be QP encoded. This rule is essentially 
detecting that
the sender used extended range character sets, but their email client 
neglected

to properly QP encode it.
  
Can You please point which RFC is this and what exactly 'QP encoding' 
means.


RFC 2822

QP = Quoted Printable



Thanks,
Milen





RE: SUBJ_ILLEGAL_CHARS

2006-03-15 Thread Randal, Phil
http://www.faqs.org/rfcs/rfc2822.html

Refer to Section 3.2.2 for information on quoted-pairs.

Cheers,

Phil


Phil Randal
Network Engineer
Herefordshire Council
Hereford, UK  

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: 15 March 2006 15:22
 To: users@spamassassin.apache.org
 Cc: Matt Kettler
 Subject: Re: SUBJ_ILLEGAL_CHARS
 
 Matt Kettler написа:
  Note that SUBJ_ILLEGAL_CHARS is NOT concerned with what language or 
  character set is used. It is concerned about it not being 
 encoded properly.
 
  Per RFC specifications, all characters in email-headers 
 that aren't in 
  the normal ascii ranges must be QP encoded. This rule is 
 essentially 
  detecting that the sender used extended range character sets, but 
  their email client neglected to properly QP encode it.

 Can You please point which RFC is this and what exactly 'QP 
 encoding' means.
 
 Thanks,
 Milen
 


RE: SUBJ_ILLEGAL_CHARS

2006-03-15 Thread Randal, Phil
And 

  http://www.faqs.org/rfcs/rfc2047.html

Cheers,

Phil


Phil Randal
Network Engineer
Herefordshire Council
Hereford, UK  

 -Original Message-
 From: Craig Morrison [mailto:[EMAIL PROTECTED] 
 Sent: 15 March 2006 15:31
 To: users@spamassassin.apache.org
 Subject: Re: SUBJ_ILLEGAL_CHARS
 
 Милен Панков wrote:
  Matt Kettler написа:
  Note that SUBJ_ILLEGAL_CHARS is NOT concerned with what 
 language or 
  character set is used. It is concerned about it not being encoded 
  properly.
 
  Per RFC specifications, all characters in email-headers 
 that aren't 
  in the normal ascii ranges must be QP encoded. This rule is 
  essentially detecting that the sender used extended range 
 character 
  sets, but their email client neglected to properly QP encode it.

  Can You please point which RFC is this and what exactly 'QP 
 encoding' 
  means.
 
 RFC 2822
 
 QP = Quoted Printable
 
  
  Thanks,
  Milen
  
 


Re: SUBJ_ILLEGAL_CHARS

2006-03-15 Thread Theo Van Dinter
On Wed, Mar 15, 2006 at 03:29:45PM -, Randal, Phil wrote:
  Can You please point which RFC is this and what exactly 'QP 
  encoding' means.
 http://www.faqs.org/rfcs/rfc2822.html
 Refer to Section 3.2.2 for information on quoted-pairs.

QP in this case does not mean quoted pairs, it means Quoted Printable
which is a MIME encoding ala http://www.faqs.org/rfcs/rfc1522.html and
http://www.faqs.org/rfcs/rfc2047.html

2822 is the RFC which talks about only US-ASCII (7-bit) in the headers,
see section 2.2.


Matt Ketler wrote:
 Per RFC specifications, all characters in email-headers that aren't in
 the normal ascii ranges must be QP encoded.

This isn't exactly correct.  2822 specifies that only US-ASCII may
appear in the headers, but it doesn't say what to do with characters
outside that range.  1522 and 2047 discusses how to use either QP or
Base64 encoding (either is valid) to deal with those headers.

-- 
Randomly Generated Tagline:
 Bender, we didn't mind your drinking or your cleptomania or your
 pornography ring. -Leela 
  In fact, that's why we love you. -Zoidberg 


pgprUGrzMZoKY.pgp
Description: PGP signature


Re: SUBJ_ILLEGAL_CHARS

2006-03-15 Thread Craig McLean
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Philip Prindeville wrote:
[snip]
  I mean it's not X.400, right?  ;-)

Thank the Gods...

C.
- --
Craig McLeanhttp://fukka.co.uk
[EMAIL PROTECTED]   Where the fun never starts
Powered by FreeBSD, and GIN!
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.1 (GNU/Linux)

iD8DBQFEGGJbMDDagS2VwJ4RAlycAJsEuPBxIMR1vwJqnlsT5nUdJKOK2wCeK4Ic
6Pq0jomOmnPcTWbH3muDC1o=
=weNm
-END PGP SIGNATURE-


SUBJ_ILLEGAL_CHARS

2006-03-14 Thread Милен Панков

Hi to all,

I'm using spamassassin for years without any serious problems.
Except for one. My users write messages mostly in bulgarian and the 
'SUBJ_ILLEGAL_CHARS' rule very often stops good mail.
I have put in my local.cf the line 'ok_languages bg en', but it doesn't fix 
the problem. For now I made this rule not giving any scores and this 
temporary fixes the problem. My question is how can I make it work without 
disabling it. I may be need to say to spamassassin not to check for 
specific encodings. For example there are at least 4 encodings my users use 
for writing/receiving mail (Windows-1251, KOI8-R, KOI8-U, UTF-8). How can I 
do that?


Milen


Re: SUBJ_ILLEGAL_CHARS

2006-03-14 Thread Matt Kettler
Милен Панков wrote:
 Hi to all,
 
 I'm using spamassassin for years without any serious problems.

First: In my answer's I'm assuming you are running 3.1.0 or higher. If you
aren't please specify your version.

 Except for one. My users write messages mostly in bulgarian and the
 'SUBJ_ILLEGAL_CHARS' rule very often stops good mail.
 I have put in my local.cf the line 'ok_languages bg en', but it doesn't
 fix the problem. 

No, if anything that will make your problem WORSE. The default here is all. By
declaring an ok_languages you're limiting the number of acceptable languages.

Also note: this won't do anything at all unless you've got the textcat plugin
loaded in your v310.pre

For now I made this rule not giving any scores and this
 temporary fixes the problem. My question is how can I make it work
 without disabling it. I may be need to say to spamassassin not to check
 for specific encodings. For example there are at least 4 encodings my
 users use for writing/receiving mail (Windows-1251, KOI8-R, KOI8-U,
 UTF-8). How can I do that?

Note that SUBJ_ILLEGAL_CHARS is NOT concerned with what language or character
set is used. It is concerned about it not being encoded properly.

Per RFC specifications, all characters in email-headers that aren't in the
normal ascii ranges must be QP encoded. This rule is essentially detecting that
the sender used extended range character sets, but their email client neglected
to properly QP encode it.

Realistically, you have two options:

1) tell the sender their client isn't properly QP encoding Bulgarian 
text in
the subject headers.
2) accept that many email clients don't properly handle Bulgarian text, 
and
disable this rule by adding score SUBJ_ILLEGAL_CHARS 0 to your local.cf.







Re: SUBJ_ILLEGAL_CHARS

2006-03-14 Thread Милен Панков

Matt Kettler написа:


Милен Панков wrote:

Hi to all,

I'm using spamassassin for years without any serious problems.


First: In my answer's I'm assuming you are running 3.1.0 or higher. If you
aren't please specify your version.


Yes, it's 3.1.0, sorry




Except for one. My users write messages mostly in bulgarian and the
'SUBJ_ILLEGAL_CHARS' rule very often stops good mail.
I have put in my local.cf the line 'ok_languages bg en', but it doesn't
fix the problem. 


No, if anything that will make your problem WORSE. The default here is all. By
declaring an ok_languages you're limiting the number of acceptable languages.

Also note: this won't do anything at all unless you've got the textcat plugin
loaded in your v310.pre



Ok. I'll have that in mind.


For now I made this rule not giving any scores and this

temporary fixes the problem. My question is how can I make it work
without disabling it. I may be need to say to spamassassin not to check
for specific encodings. For example there are at least 4 encodings my
users use for writing/receiving mail (Windows-1251, KOI8-R, KOI8-U,
UTF-8). How can I do that?


Note that SUBJ_ILLEGAL_CHARS is NOT concerned with what language or character
set is used. It is concerned about it not being encoded properly.

Per RFC specifications, all characters in email-headers that aren't in the
normal ascii ranges must be QP encoded. This rule is essentially detecting that
the sender used extended range character sets, but their email client neglected
to properly QP encode it.

Realistically, you have two options:

1) tell the sender their client isn't properly QP encoding Bulgarian 
text in
the subject headers.
2) accept that many email clients don't properly handle Bulgarian text, 
and
disable this rule by adding score SUBJ_ILLEGAL_CHARS 0 to your local.cf.



Well this happens mostly when we receive mail from some webmails for 
example Yahoo, so I'm stuck with the second option, which I'm already using.


Thanks,
Milen


Re: SUBJ_ILLEGAL_CHARS

2006-03-14 Thread Philip Prindeville
Милен Панков wrote:
 Matt Kettler написа:
Realistically, you have two options:

  1) tell the sender their client isn't properly QP encoding Bulgarian 
 text in
the subject headers.
  2) accept that many email clients don't properly handle Bulgarian text, 
 and
disable this rule by adding score SUBJ_ILLEGAL_CHARS 0 to your local.cf.

 
 
 Well this happens mostly when we receive mail from some webmails for 
 example Yahoo, so I'm stuck with the second option, which I'm already using.
 
 Thanks,
 Milen


It's an issue, to be sure.  And people need to be edumacated.

I recently pointed out to the IT department at Dice.com that they were sending
out malformed Date: lines that were causing their emails to trigger against
ILLEGAL_DATE...  which most mailers manage to get right, so it's a fairly good
indicator of spam and can be safely cranked way up.

In fact, I pointed out chapter and verse from RFC-2821 where they were going
wrong, and how to fix it (by padding the hour out with a leading zero before
10am).

They told me they appreciated my suggestion.

I reminded them that it wasn't a suggestion, it was a conclusive documentation
of where they were failing to conform to a 25 year-old specification that is,
in fact, trivial... all things considered.   I mean it's not X.400, right?  ;-)

Have they fixed it?

Not the last time I checked.

You'd think that given the nature of what they do, they'd have their pick of
the crop for good IT and messaging people.

Guess not.

Kind of makes me think twice about posting my resume with them.  :-(

-Philip