Re: Rule for Russian character sets

2008-02-18 Thread jidanni
Hmm, let me see. I use the below in user_prefs. Hope that helps.
header J_CHSET3 Subject:raw =~ 
/\s=\?(windows-(125[0125]|874)|koi8-r|iso-8859-[28])\?/i
score J_CHSET3 5
ifplugin Mail::SpamAssassin::Plugin::TextCat
#ok_languages en zh.big5
#http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5697
ok_languages en zh
add_header all Languages _LANGUAGES_
score UNWANTED_LANGUAGE_BODY 5
endif
ok_locales en zh


Re: FW: Rule for Russian character sets (=?koi8-r? not quite acharset)

2008-02-18 Thread Karsten Bräckelmann
On Mon, 2008-02-18 at 09:36 +1300, Michael Hutchinson wrote:
   We don't want to only allow the English locale, because we (here at
   my work) do not want our international clients (non Russian) to be
   denied email service.
  
  ok_locales  en ja ko th zh
  
  This will allow anything but Cyrillic char sets. Please note that en
  does *not* mean English locale despite its name. It applies to all
  Western charsets, including German Umlauts, Swedisch, French, Turkish,
  etc. Basically everything that uses the characters in this post, plus
  language specific chars.
  
 Ok now we're talking turkey. Thanks for providing the much needed
 clarity on ok_locales. I may just employ that technique yet, pending
 whether we get any more Russian spam through the gates.
 
  Sorry, I did not mean to troll nor any kind of offense.
 
 You have my apologies, as being a Friday afternoon, I was pretty sick of
 work and shouldn't have taken it out on you or the list. Sorry.

  Hope this clarifies my previous posts and is appreciated again...
 
 Your posts are appreciated, and sorry for the mean comment.

Thanks.  No offense taken, no harm done, don't worry. :)

  guenther


-- 
char *t=[EMAIL PROTECTED];
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



RE: Rule for Russian character sets

2008-02-17 Thread Michael Hutchinson

 -Original Message-
 For the most part you can match any character by the appearance of the
 character.  Any character with special meaning needs to be escaped in
some
 way.  The easiest way is usually with a backslash, but in some cases
you
 can
 also do it by making it a member of a character class.
 
 So for you questionmark case, you could do \? or [?], as most of the
 special
 characters lose their meaning in a character class.  The exceptions
are
 obviously right bracket, backslash, and dash becomes special if it
isn't
 the
 first character.
 
  /\=\?koi8\-r\?/

This is what I'd setup originally, except when I ran it past a RE
interpreter the results were just.. wrong. I do think it would work,
however, and will be testing it on a Virtual Machine today to be sure.

 This should work.  You don't need to escape the dash, and I'm pretty
sure
 you don't need to escape the equal sign; just the questionmark.
 
 Also, you may want to handle this in both uppercase and lowercase, so
you
 could do
 
 /=\?koi8-r\?/i
 
 And you probably don't need the = sign to get reasonably reliable
 matching.
 
Ah, this is the bit I was unsure about, limiting how many characters are
escaped. I would tend towards the fully escaped one myself, I just
wouldn't trust non-escaped = and ? signs. But that's probably got to do
with some bad history with Spamassassin:)

Thanks for reinforcing some points with RE that needed to be (:

Cheers,
Mike




FW: Rule for Russian character sets (=?koi8-r? not quite acharset)

2008-02-17 Thread Michael Hutchinson
-Original Message-snipsnip
  We don't want to only allow the English locale, because we (here
at
  my work) do not want our international clients (non Russian) to be
  denied email service.
 
 ok_locales  en ja ko th zh
 
 This will allow anything but Cyrillic char sets. Please note that en
 does *not* mean English locale despite its name. It applies to all
 Western charsets, including German Umlauts, Swedisch, French, Turkish,
 etc. Basically everything that uses the characters in this post, plus
 language specific chars.
 
Ok now we're talking turkey. Thanks for providing the much needed
clarity on ok_locales. I may just employ that technique yet, pending
whether we get any more Russian spam through the gates.

 Sorry, I did not mean to troll nor any kind of offense.

You have my apologies, as being a Friday afternoon, I was pretty sick of
work and shouldn't have taken it out on you or the list. Sorry.
 
 However, you missed my point. Getting detailed with REs is a good
thing,
 sure. I was not about that -- but the RE in question does not properly
 handle charset encoding. See the Subject for an example which is not
 encoding, but will be matched by your rule.
 
 My point was, that the rule discussed aims at being something that it
 unfortunately is not, because charset encoding is slightly more
complex
 and definitely requires a closing part. A Regular Expression that does
 this can be found in check_for_faraway_charset_in_headers() in
 HeaderEval.pm:
   $hdr =~ /=\?(.+?)\?.\?.*?\?=/g
 
 Hence, the my re-inventing the wheel analogy. And these wheels are
quite
 flexible, too. ;-)
 
 Also, your rule applies to the Subject only, whereas ok_locales does
 check all MIME parts and will trigger on Russian spam with a western
 Subject.

The RE in question (my one) was not just written for subject, but a
separate rule was written for the raw From: line as well. As we only
score spam here and leave filing it to the MUA (unless a score of 25 is
reached, where SA bins it), scoring against the Subject and From lines
makes OK sense, because if you used simply (=?koi8-r?) in the subject it
would not score high enough on it's own to be filtered or blocked. (I'm
trying to employ what I've learned from the SA webpage about writing
multiple low-scoring rules, instead of a few big-scoring ones).

I can see it is flawed, but have to also admit that it is working rather
well at the moment. Mind you, I have taken the time to translate some of
the Russian Spam, work out spammy phrases, and then quote those phrases
to be scored against by SA.

 Hope this clarifies my previous posts and is appreciated again...

Your posts are appreciated, and sorry for the mean comment.

Cheers,
Mike



RE: Rule for Russian character sets (=?koi8-r? not quite a charset)

2008-02-15 Thread Karsten Bräckelmann
On Fri, 2008-02-15 at 17:10 +1300, Michael Hutchinson wrote:
  From: Karsten Bräckelmann [mailto:[EMAIL PROTECTED]

  Why are you guys now trying to re-invent the wheel in the special case
  of a gray asphalt street? What about a dirt track, grass, and anything
  else a wheel works on?
  
  I've pointed it out before. Just use ok_locales, which is all about
  these char sets. No REs, almost no thinking required, no headache. A
  single line, and you're done.
 
 We don't want to only allow the English locale, because we (here at
 my work) do not want our international clients (non Russian) to be
 denied email service. 

ok_locales  en ja ko th zh

This will allow anything but Cyrillic char sets. Please note that en
does *not* mean English locale despite its name. It applies to all
Western charsets, including German Umlauts, Swedisch, French, Turkish,
etc. Basically everything that uses the characters in this post, plus
language specific chars.


 That aside, I really don't think getting detailed with Regular
 Expressions is re-inventing the wheel. Rather, it is expanding
 knowledge that will help write better rules in the future. (More
 flexible wheels, in your context).
 
 Although I appreciated your earlier post of 'ok_locales', and
 understood it, I did not appreciate your Troll.

Sorry, I did not mean to troll nor any kind of offense.

However, you missed my point. Getting detailed with REs is a good thing,
sure. I was not about that -- but the RE in question does not properly
handle charset encoding. See the Subject for an example which is not
encoding, but will be matched by your rule.

My point was, that the rule discussed aims at being something that it
unfortunately is not, because charset encoding is slightly more complex
and definitely requires a closing part. A Regular Expression that does
this can be found in check_for_faraway_charset_in_headers() in
HeaderEval.pm:
  $hdr =~ /=\?(.+?)\?.\?.*?\?=/g

Hence, the my re-inventing the wheel analogy. And these wheels are quite
flexible, too. ;-)

Also, your rule applies to the Subject only, whereas ok_locales does
check all MIME parts and will trigger on Russian spam with a western
Subject.


Hope this clarifies my previous posts and is appreciated again...

  guenther


-- 
char *t=[EMAIL PROTECTED];
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Rule for Russian character sets

2008-02-15 Thread Paul Douglas Franklin


I believe that what you are asking for is
meta RUSSIAN_AND_BADTEXT (CHARSET_FARAWAY  __OTHER_RULE)
That requires first that you have set up ok_locales.
--Paul

Rosenbaum, Larry M. wrote:

From: Karsten Bräckelmann [mailto:[EMAIL PROTECTED]

I've pointed it out before. Just use ok_locales, which is all about
these char sets. No REs, almost no thinking required, no headache. A
single line, and you're done.



What's the best way to test the character set for use in a meta rule?  We don't 
want to reject all messages with the Russian (Cyrillic) character set, but we 
may want to use something like

if (character set is Russian)  (body contains 'xyzzy')

for instance.  How would we test the character set?
  


--
Paul Douglas Franklin
Computer Manager, Union Gospel Mission of Yakima, Washington
Husband of Danette
Father of Laurene, Miriam, Tycko, Timothy, Sarabeth, Marie, Dawnita, Anna Leah, 
Alexander, and Caleb



Re: Rule for Russian character sets

2008-02-15 Thread jidanni
KB If you want to trigger on Russian only, list all but ru.
What if to catch Ms. Ba'loney  Margar'ine, airport security had to keep a
current list of all the other people in the world. So this is the
wrong approach, as we've been thru before. OK, bye.


Re: Rule for Russian character sets

2008-02-15 Thread McDonald, Dan

On Fri, 2008-02-15 at 11:04 -0800, Paul Douglas Franklin wrote:
 I believe that what you are asking for is
 meta RUSSIAN_AND_BADTEXT (CHARSET_FARAWAY  __OTHER_RULE)
 That requires first that you have set up ok_locales.

If you have TextCat enabled, then the X-Language: meta header will be
added and can be used with rules, although it doesn't show up in the
output.

I don't think that there is an equivalent X-Locales: 


 --Paul
 
 Rosenbaum, Larry M. wrote:
  From: Karsten Bräckelmann [mailto:[EMAIL PROTECTED]
 
  I've pointed it out before. Just use ok_locales, which is all about
  these char sets. No REs, almost no thinking required, no headache. A
  single line, and you're done.
  
 
  What's the best way to test the character set for use in a meta rule?  We 
  don't want to reject all messages with the Russian (Cyrillic) character 
  set, but we may want to use something like
 
  if (character set is Russian)  (body contains 'xyzzy')
 
  for instance.  How would we test the character set?

 
-- 
Daniel J McDonald, CCIE #2495, CISSP #78281, CNX
Austin Energy
http://www.austinenergy.com



signature.asc
Description: This is a digitally signed message part


RE: Rule for Russian character sets

2008-02-15 Thread Rosenbaum, Larry M.
 From: Karsten Bräckelmann [mailto:[EMAIL PROTECTED]

 I've pointed it out before. Just use ok_locales, which is all about
 these char sets. No REs, almost no thinking required, no headache. A
 single line, and you're done.

What's the best way to test the character set for use in a meta rule?  We don't 
want to reject all messages with the Russian (Cyrillic) character set, but we 
may want to use something like

if (character set is Russian)  (body contains 'xyzzy')

for instance.  How would we test the character set?


RE: Rule for Russian character sets

2008-02-15 Thread Karsten Bräckelmann
On Fri, 2008-02-15 at 11:49 -0500, Rosenbaum, Larry M. wrote:
  From: Karsten Bräckelmann [mailto:[EMAIL PROTECTED]
 
  I've pointed it out before. Just use ok_locales, which is all about
  these char sets. No REs, almost no thinking required, no headache. A
  single line, and you're done.
 
 What's the best way to test the character set for use in a meta rule? 
 We don't want to reject

SA doesn't reject anyway. It merely classifies and tags mail.

 all messages with the Russian (Cyrillic)
 character set, but we may want to use something like
 
 if (character set is Russian)  (body contains 'xyzzy')

Well, it depends...

If it is ok for you to treat all char sets, which you did not set in
ok_locales, the same way, then it is just a regular meta rule -- and
based on my understanding of your description re-scoring of the few
CHARSET_FARAWY rules.

 for instance.  How would we test the character set?

This I believe can not be done with the current HeaderEval plugin, since
it does not report the char set, but treats all unwanted char sets the
same. However, if you need fine grained rules per char set, it should be
fairly easy to alter the existing plugin or to write custom rules or
plugin based on this.

  guenther


-- 
char *t=[EMAIL PROTECTED];
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Rule for Russian character sets

2008-02-15 Thread Karsten Bräckelmann
On Sat, 2008-02-16 at 04:26 +0800, [EMAIL PROTECTED] wrote:
 KB If you want to trigger on Russian only, list all but ru.
 What if to catch Ms. Ba'loney  Margar'ine, airport security had to keep a
 current list of all the other people in the world. So this is the
 wrong approach, as we've been thru before. OK, bye.

Thank you for your most valuable contribution.

Yes, we've been through this before. However, it seems you still don't
understand. There IS NO negated counterpart to ok_locales. Also, this is
not about languages, but character sets -- and there are exactly 6. So,
listing all but one in this context doesn't seem to be asking too much.

Instead of ranting, just try to understand ok_locales as an option to
list all character sets you can read. For most people, this boils down
to one or two anyway. Thus, the general usecase is to list just these.

Also, the OP specifically asked to catch Russian only. Listing 5 locales
is the only way to do this currently. If you know about a better way,
please let me know.

Otherwise, you just wasted everyone's time. Had a bad day, eh?

  guenther


-- 
char *t=[EMAIL PROTECTED];
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Rule for Russian character sets

2008-02-14 Thread up

We're suddenly getting a ton of spam with koi8-r encoding...I tried to do
a custom rule for it like this:

header SUBJ_RUSS_CHAR   Subject =~/koi8-r/i
describe SUBJ_RUSS_CHAR has Russian char encoding
score SUBJ_RUSS_CHAR3.5

The short headers for these spams look like this:

Subject: [koi8-r] ??? 

The raw Subject header, like this:

Subject: =?koi8-r?B?9/zkINDSxcTQ0snR1MnKINPFzcnOwdI=?=

I would think the rule would catch it either way...what am I missing?

TIA,

James Smallacombe PlantageNet, Inc. CEO and Janitor
[EMAIL PROTECTED]   
http://3.am
=





Re: Rule for Russian character sets

2008-02-14 Thread Per Jessen
[EMAIL PROTECTED] wrote:

 
 We're suddenly getting a ton of spam with koi8-r encoding...I tried to
 do a custom rule for it like this:
 
 header SUBJ_RUSS_CHAR   Subject =~/koi8-r/i
 describe SUBJ_RUSS_CHAR has Russian char encoding
 score SUBJ_RUSS_CHAR3.5
 
 The short headers for these spams look like this:
 
 Subject: [koi8-r] ??? 
 
 The raw Subject header, like this:
 
 Subject: =?koi8-r?B?9/zkINDSxcTQ0snR1MnKINPFzcnOwdI=?=
 
 I would think the rule would catch it either way...what am I missing?

I think this should work:

header SUBJ_RUSS_CHAR   Subject:raw =~ /koi8-r/i



/Per Jessen, Zürich



Re: Rule for Russian character sets

2008-02-14 Thread Karsten Bräckelmann
On Thu, 2008-02-14 at 10:17 -0500, [EMAIL PROTECTED] wrote:
 We're suddenly getting a ton of spam with koi8-r encoding...I tried to do
 a custom rule for it like this:
 
 header SUBJ_RUSS_CHAR   Subject =~/koi8-r/i
 describe SUBJ_RUSS_CHAR has Russian char encoding
 score SUBJ_RUSS_CHAR3.5

 I would think the rule would catch it either way...what am I missing?

I guess its being decoded before matching. It's not the actual subject
anyway, but a charset definition.


Instead of writing your own rules to catch these, I suggest using
ok_locales. See the Language Options:
  http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Conf.html

If you want to trigger on Russian only, list all but ru. However, you
probably want more like en (all western charsets) only. ;)  Also, this
will trigger on header as well as on the body. grep for CHARSET_FARAWAY
in the rules, if you want to adjust its scores.

  guenther


-- 
char *t=[EMAIL PROTECTED];
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Rule for Russian character sets

2008-02-14 Thread up
On Thu, 14 Feb 2008, Per Jessen wrote:

 [EMAIL PROTECTED] wrote:

 
  We're suddenly getting a ton of spam with koi8-r encoding...I tried to
  do a custom rule for it like this:
 
  header SUBJ_RUSS_CHAR   Subject =~/koi8-r/i
  describe SUBJ_RUSS_CHAR has Russian char encoding
  score SUBJ_RUSS_CHAR3.5
 
  The short headers for these spams look like this:
 
  Subject: [koi8-r] ??? 
 
  The raw Subject header, like this:
 
  Subject: =?koi8-r?B?9/zkINDSxcTQ0snR1MnKINPFzcnOwdI=?=
 
  I would think the rule would catch it either way...what am I missing?

 I think this should work:

 header SUBJ_RUSS_CHAR   Subject:raw =~ /koi8-r/i

That did it, thanks!

James Smallacombe PlantageNet, Inc. CEO and Janitor
[EMAIL PROTECTED]   
http://3.am
=



RE: Rule for Russian character sets

2008-02-14 Thread Michael Hutchinson

 -Original Message-
   We're suddenly getting a ton of spam with koi8-r encoding...I
tried to
   do a custom rule for it like this:
  
   header SUBJ_RUSS_CHAR   Subject =~/koi8-r/i
   describe SUBJ_RUSS_CHAR has Russian char encoding
   score SUBJ_RUSS_CHAR3.5
  
   The short headers for these spams look like this:
  
   Subject: [koi8-r] ??? 
  
   The raw Subject header, like this:
  
   Subject: =?koi8-r?B?9/zkINDSxcTQ0snR1MnKINPFzcnOwdI=?=
  
   I would think the rule would catch it either way...what am I
missing?
 
  I think this should work:
 
  header SUBJ_RUSS_CHAR   Subject:raw =~ /koi8-r/i
 
 That did it, thanks!
 

Are we not meant to delimit characters like a minus sign?

Ex:
header SUBJ_RUSS_CHAR   Subject:raw =~ /koi8\-r/i

I would really like to trap the question marks too, just in case someone
sends a legitimate email with koi8-r in the subject (ie: why does email
with the koi8-r character set get tagged as spam?)

In other words, the following rule (if it worked) would be nice to use
instead:

Ex:

Header SUBJ_RUSS_CHAR   Subject:raw =~ /\=\?koi8\-r\?/

Where we could trap the Equals sign, and two question marks. I have not
employed this rule because I think its dodgy, the Regexp expander over
at SARE says there is a scary amount of matches (2000+) with that rule,
so I'm presuming that the matching for the equals character and the
question mark are not working properly, and will have to be delimited
some other way. For example, using the \x1B notation, but I've had no
luck with this.

Does anyone have suggestions for matching question marks and equals
signs in one line? I would like to match everything exactly between the
double quotes:

=?koi8-r?

If I were to read the perldoc docs I'd be using \=\?koi8\-r\?
But I don't want to test it on my live server, because of the output of
the Regex expander utility.

Help anyone?

Cheers,
Mike





RE: Rule for Russian character sets

2008-02-14 Thread John Hardin

On Fri, 15 Feb 2008, Michael Hutchinson wrote:


Are we not meant to delimit characters like a minus sign?

Ex:
header SUBJ_RUSS_CHAR   Subject:raw =~ /koi8\-r/i


Only where they have special meaning, and a dash is only special in a 
character set, e.g. [A-Z]. I have found the simplest way to avoid 
misinterpretation in that context is to put the dash first, e.g. 
[-abcde12345]


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...to announce there must be no criticism of the President or to
  stand by the President right or wrong is not only unpatriotic and
  servile, but is morally treasonous to the American public.
  -- Theodore Roosevelt, 1918
---
 8 days until George Washington's 276th Birthday


RE: Rule for Russian character sets

2008-02-14 Thread Michael Hutchinson

 -Original Message-
 From: John Hardin [mailto:[EMAIL PROTECTED]
 Sent: Friday, 15 February 2008 2:19 p.m.
 To: Michael Hutchinson
 Cc: users@spamassassin.apache.org
 Subject: RE: Rule for Russian character sets
 
 On Fri, 15 Feb 2008, Michael Hutchinson wrote:
 
  Are we not meant to delimit characters like a minus sign?
 
  Ex:
  header SUBJ_RUSS_CHAR   Subject:raw =~
/koi8\-r/i
 
 Only where they have special meaning, and a dash is only special in
a
 character set, e.g. [A-Z]. I have found the simplest way to avoid
 misinterpretation in that context is to put the dash first, e.g.
 [-abcde12345]

Ok fair enough. I've noticed that having the \ doesn't hurt for a dash.
Now what about matching a question mark and an equals sign? 

I'm tempted to setup Spamassassin under a virtual machine, just so I can
test against \= and \?

I've read perlre and perlretut and understand regular expressions, but
there is no clear cut way of matching these characters, either outlined
by this document or any Spamassassin document I've come across so far.

Except for a backslash, but I've heard no testimony would suggest this
line will work with Spamassassin, and like before, the SARE Regular
Expressions Expander tool doesn't like it (and may have put un-due doubt
in my head):

/\=\?koi8\-r\?/

I tried using \x1B notation, and it doesn't work, so presumably, not
every feature of perl regular expressions work under Spamassassin.

Cheers,
Mike



FW: Rule for Russian character sets

2008-02-14 Thread Michael Hutchinson

-Original Message-
From: John Hardin [mailto:[EMAIL PROTECTED] 
Sent: Friday, 15 February 2008 3:07 p.m.
To: Michael Hutchinson
Subject: RE: Rule for Russian character sets

On Fri, 15 Feb 2008, Michael Hutchinson wrote:

 Now what about matching a question mark and an equals sign?

An equals sign isn't special but a question mark is.

 Except for a backslash, but I've heard no testimony would suggest this
 line will work with Spamassassin, and like before, the SARE Regular
 Expressions Expander tool doesn't like it (and may have put un-due
doubt
 in my head):

 /\=\?koi8\-r\?/

Try/=\?koi8-r\?/i

NB: You can also use [?] (a character set consisting of a single
question 
mark) but that's a little clumsy.

-- 
  John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
  [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  It may be possible to start a programme of weapon registration as a
  first step towards the physical collection phase. ... Assurances
  must be provided, and met, that the process of registration will
  not lead to immediate weapons seizures by security forces.
   -- the UN, who doesn't want to confiscate guns
---
  8 days until George Washington's 276th Birthday


RE: Rule for Russian character sets

2008-02-14 Thread Michael Hutchinson
 -Original Message-
 From: John Hardin [mailto:[EMAIL PROTECTED]
 Sent: Friday, 15 February 2008 3:07 p.m.
 To: Michael Hutchinson
 Subject: RE: Rule for Russian character sets
 
 On Fri, 15 Feb 2008, Michael Hutchinson wrote:
 
  Now what about matching a question mark and an equals sign?
 
 An equals sign isn't special but a question mark is.
 
  Except for a backslash, but I've heard no testimony would suggest
this
  line will work with Spamassassin, and like before, the SARE Regular
  Expressions Expander tool doesn't like it (and may have put un-due
doubt
  in my head):
 
  /\=\?koi8\-r\?/
 
 Try/=\?koi8-r\?/i
 
 NB: You can also use [?] (a character set consisting of a single
question
 mark) but that's a little clumsy.

OK sounds good, might just have to test that one under Vmware as well. 

Results from SARE Regexp expander weren't good, I don't know if I should
trust that thing anymore.

Thanks,
Mike



RE: Rule for Russian character sets

2008-02-14 Thread Karsten Bräckelmann
On Fri, 2008-02-15 at 12:19 +1300, Michael Hutchinson wrote:
[...]
 Does anyone have suggestions for matching question marks and equals
 signs in one line? I would like to match everything exactly between the
 double quotes:

Apart from neither equal nor minus being any special in an RE (outside a
char class) unlike the question mark, which has been answered already...


Why are you guys now trying to re-invent the wheel in the special case
of a gray asphalt street? What about a dirt track, grass, and anything
else a wheel works on?

I've pointed it out before. Just use ok_locales, which is all about
these char sets. No REs, almost no thinking required, no headache. A
single line, and you're done.

  guenther


-- 
char *t=[EMAIL PROTECTED];
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



RE: Rule for Russian character sets

2008-02-14 Thread Michael Hutchinson
 -Original Message-
 From: Karsten Bräckelmann [mailto:[EMAIL PROTECTED]
 Sent: Friday, 15 February 2008 3:43 p.m.
 To: users@spamassassin.apache.org
 Subject: RE: Rule for Russian character sets
 
 On Fri, 2008-02-15 at 12:19 +1300, Michael Hutchinson wrote:
 [...]
  Does anyone have suggestions for matching question marks and equals
  signs in one line? I would like to match everything exactly between the
  double quotes:
 
 Apart from neither equal nor minus being any special in an RE (outside a
 char class) unlike the question mark, which has been answered already...
 
 
 Why are you guys now trying to re-invent the wheel in the special case
 of a gray asphalt street? What about a dirt track, grass, and anything
 else a wheel works on?
 
 I've pointed it out before. Just use ok_locales, which is all about
 these char sets. No REs, almost no thinking required, no headache. A
 single line, and you're done.
 
   guenther

We don't want to only allow the English locale, because we (here at my work) 
do not want our international clients (non Russian) to be denied email service. 

That aside, I really don't think getting detailed with Regular Expressions is 
re-inventing the wheel. Rather, it is expanding knowledge that will help write 
better rules in the future. (More flexible wheels, in your context).

Although I appreciated your earlier post of 'ok_locales', and understood it, I 
did not appreciate your Troll.

Cheers,
Mike



Re: Rule for Russian character sets

2008-02-14 Thread Loren Wilton

Ok fair enough. I've noticed that having the \ doesn't hurt for a dash.
Now what about matching a question mark and an equals sign?


If you read perlre closely you will find it says that it never hurts to put 
a backslash before a special character that you want to match as a 
character.  So this is a case of does nothing, but doesn't hurt.




I've read perlre and perlretut and understand regular expressions, but
there is no clear cut way of matching these characters, either outlined
by this document or any Spamassassin document I've come across so far.


For the most part you can match any character by the appearance of the 
character.  Any character with special meaning needs to be escaped in some 
way.  The easiest way is usually with a backslash, but in some cases you can 
also do it by making it a member of a character class.


So for you questionmark case, you could do \? or [?], as most of the special 
characters lose their meaning in a character class.  The exceptions are 
obviously right bracket, backslash, and dash becomes special if it isn't the 
first character.



/\=\?koi8\-r\?/


This should work.  You don't need to escape the dash, and I'm pretty sure 
you don't need to escape the equal sign; just the questionmark.


Also, you may want to handle this in both uppercase and lowercase, so you 
could do


   /=\?koi8-r\?/i

And you probably don't need the = sign to get reasonably reliable matching.

   Loren