Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?

2004-05-21 Thread Matt




Scott,

All of those patterns came directly from the source code of E-mails. 
The charset=3dgb2312" pattern is quite common in the body of E-mail, in
fact I think that it is FrontPage or another Microsoft product that
encodes it this way when it appears in a META tag.

Share and share alike :)

Matt



Scott Fisher wrote:

  I used this reference on names character sets:
http://msdn.microsoft.com/library/default.asp?url="">

Would you really encounter a =3d ?
Should it be charset3dgb2312?

Enjoyed your thoughts, glad you didn't post this on a Monday...

Scott Fisher
Director of IT
Farm Progress Companies

  
  

  
[EMAIL PROTECTED] 05/21/04 04:28PM >>>

  

  
  I think you might have possibly identified the group of required 
characters.  I'll give that a try.  I'm not sure if any Cyrillic stuff 
has been passing through but this bears watching as well and I might 
have to change my list there as well.

I am also tagging BIG5, however almost all spam comes in GB2312.  Here's 
what I'm searching for in the CHINESE filter:

# CHINESE v1.0.0

SKIPIFWEIGHT25
MAXWEIGHT10

TESTSFAILEDENDNOTCONTAINSHIGHBIT

SUBJECTENDCONTAINScharset=gb2312
SUBJECTENDCONTAINScharset="gb2312"
SUBJECTENDCONTAINScharset=big5
SUBJECTENDCONTAINScharset="big5"

HEADERS10CONTAINS=?gb2312?b?
HEADERS10CONTAINS=?big5?b?
HEADERS10CONTAINScharset=gb2312
HEADERS10CONTAINScharset="gb2312"
HEADERS10CONTAINScharset=big5
HEADERS10CONTAINScharset="big5"

BODY10CONTAINScharset=gb2312"
BODY10CONTAINScharset=3dgb2312"
BODY10CONTAINScharset=big5"
BODY10CONTAINScharset=3dbig5"
BODY10CONTAINScontent=zh-cn"
BODY10CONTAINScontent=3dzh-cn"


The END statements for the subject are meant as a precaution, although 
it's probably not necessary with the HIGHBIT filter ending on US-ASCII 
and ISO-8859-1 (plus a language definition hit for 'content="en-us"').

I do believe that you can apply a similar technique to spam in Spanish, 
but since the characterset is the same as English, you would be 
searching for those 'content=' markers in combination with special 
characters (a short list in this case).  We hardly see any Spanish spam, 
or at least held Spanish spam so I'm doing nothing about it.  Spanish is 
of course a lot more common in US E-mail.  It may be that some Spanish 
spam isn't identified as Spanish since that's not necessary for proper 
display in most E-mail clients, but I have seen no proof of that.

Matt



Scott Fisher wrote:

  
  
Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the headers. This is overwhelmingly SPAM, but like you siad there are English in some of those messages.

It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. 
and an A0 to FF as it's lowbyte. 
If the GB2312 Chinese is present, I would think most every character should be one of these:
°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷

Checking some of my e-mails confirms that.

The bad news is that requires another body filter. It's too bad there wasn't a BODY256 filter type where only the first 256 bytes would be checked. That would certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain that I'd want to throw another body filter at my few Chinese spams.

How often do you get a body indication of GB2312 / Cyrillic charactersets with no header indication?

It's an interesting subject because I those few Chinese spams that get through to three of my accounts frustrate me.
Got any tips for Spanish spam?

Scott Fisher
Director of IT
Farm Progress Companies

 



  

  [EMAIL PROTECTED] 05/21/04 03:17PM >>>
   

  

  

No, just one, but it won't score unless there is a header or body 
indication of the GB2312 or Windows-1251 charactersets.  I'm using a 
combo filter in Declude where the HIGHBIT filter is non-scoring, and the 
CHINESE and CYRILLIC filters contain a line that says:

   TESTSFAILED  END  NOTCONTAINS  HIGHBIT

I'm pretty sure that the CHINESE and CYRILLIC filters will always hit 
where appropriate unless the HIGHBIT test doesn't hit.  I have about 65 
different high bit characters in that filter presently, all copied from 
spam.  If Scott was around, I would ask him how the NONENGLISH test is 
tripped because that might accomplish the same goals, however I'm not 
sure if it also scores the definition of a characterset, in which case 
it would have false positives in this scenario.

Matt



Scott Fisher wrote:

 



  Interesting.

Are you searching for 2 character pairs with GB2312?

S

Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?

2004-05-21 Thread Matt




Not if you want to exclude certain domains from hitting individual
languages.  The test for HIGHBIT pre-qualifies the message, and then
there are separate tests for CHINESE and CYRILLIC that can each be
independently excluded.  I would have to exclude both
languages/charactersets if I reversed the order.

There are currently 35 characters to be searched after updating for
your list and trimming somewhat randomly for size.  Considering the END
statements for US-ASCII and ISO-8859-1, and SKIPIFWEIGHT settings, the
full filter should rarely ever run.  You could also do a TESTSFAILED
END NOTCONTAINS NONENGLISH potentially, but I'm still not positive how
that hits.  Like I said before, I'm not sure if NONENGLISH hits on
characters or the definition of charactersets, or both, and whether or
not it has any limitations (like hitting on common extended English
characters).  If it only hits on what we consider high bit characters,
the HIGHBIT test could be abandoned.

BTW, I just reviewed in detail why some of this stuff was still
appearing in my Hold file and the reason is two-fold.  For one, I
needed to add more points as all of the Chinese and Cyrillic
charactersets that I was tagging did in fact hit these filters and
scored 10 points.  I was being somewhat cautious since this was a new
filter and I did find that one false positive so I'll wait a bit longer
to increase the score.  The second reason why I was still seeing this
some funky stuff was that some of the Cyrillic messages were encoded in
KOI8-R and I wasn't filtering for that...but I am now :)  I also added
KOI8-U (the Ukrainian version) just for good measure.

Matt





Scott Fisher wrote:

  Wouldn't it be better to reverse the order?

Run the subject and header tests on the majority of the mail.
Then run the body with a TESTSFAILED END NOTCONTAINS CHINESE. 
You should end up with less body searches this way.

Scott Fisher
Director of IT
Farm Progress Companies

  
  

  
[EMAIL PROTECTED] 05/21/04 04:28PM >>>

  

  
  I think you might have possibly identified the group of required 
characters.  I'll give that a try.  I'm not sure if any Cyrillic stuff 
has been passing through but this bears watching as well and I might 
have to change my list there as well.

I am also tagging BIG5, however almost all spam comes in GB2312.  Here's 
what I'm searching for in the CHINESE filter:

# CHINESE v1.0.0

SKIPIFWEIGHT25
MAXWEIGHT10

TESTSFAILEDENDNOTCONTAINSHIGHBIT

SUBJECTENDCONTAINScharset=gb2312
SUBJECTENDCONTAINScharset="gb2312"
SUBJECTENDCONTAINScharset=big5
SUBJECTENDCONTAINScharset="big5"

HEADERS10CONTAINS=?gb2312?b?
HEADERS10CONTAINS=?big5?b?
HEADERS10CONTAINScharset=gb2312
HEADERS10CONTAINScharset="gb2312"
HEADERS10CONTAINScharset=big5
HEADERS10CONTAINScharset="big5"

BODY10CONTAINScharset=gb2312"
BODY10CONTAINScharset=3dgb2312"
BODY10CONTAINScharset=big5"
BODY10CONTAINScharset=3dbig5"
BODY10CONTAINScontent=zh-cn"
BODY10CONTAINScontent=3dzh-cn"


The END statements for the subject are meant as a precaution, although 
it's probably not necessary with the HIGHBIT filter ending on US-ASCII 
and ISO-8859-1 (plus a language definition hit for 'content="en-us"').

I do believe that you can apply a similar technique to spam in Spanish, 
but since the characterset is the same as English, you would be 
searching for those 'content=' markers in combination with special 
characters (a short list in this case).  We hardly see any Spanish spam, 
or at least held Spanish spam so I'm doing nothing about it.  Spanish is 
of course a lot more common in US E-mail.  It may be that some Spanish 
spam isn't identified as Spanish since that's not necessary for proper 
display in most E-mail clients, but I have seen no proof of that.

Matt



Scott Fisher wrote:

  
  
Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the headers. This is overwhelmingly SPAM, but like you siad there are English in some of those messages.

It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. 
and an A0 to FF as it's lowbyte. 
If the GB2312 Chinese is present, I would think most every character should be one of these:
°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷

Checking some of my e-mails confirms that.

The bad news is that requires another body filter. It's too bad there wasn't a BODY256 filter type where only the first 256 bytes would be checked. That would certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain that I'd want to throw another body filter at my few Chinese spams.

How often 

Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?

2004-05-21 Thread Scott Fisher
I used this reference on names character sets:
http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/charsets/charset4.asp

Would you really encounter a =3d ?
Should it be charset3dgb2312?

Enjoyed your thoughts, glad you didn't post this on a Monday...

Scott Fisher
Director of IT
Farm Progress Companies

>>> [EMAIL PROTECTED] 05/21/04 04:28PM >>>
I think you might have possibly identified the group of required 
characters.  I'll give that a try.  I'm not sure if any Cyrillic stuff 
has been passing through but this bears watching as well and I might 
have to change my list there as well.

I am also tagging BIG5, however almost all spam comes in GB2312.  Here's 
what I'm searching for in the CHINESE filter:

# CHINESE v1.0.0

SKIPIFWEIGHT25
MAXWEIGHT10

TESTSFAILEDENDNOTCONTAINSHIGHBIT

SUBJECTENDCONTAINScharset=gb2312
SUBJECTENDCONTAINScharset="gb2312"
SUBJECTENDCONTAINScharset=big5
SUBJECTENDCONTAINScharset="big5"

HEADERS10CONTAINS=?gb2312?b?
HEADERS10CONTAINS=?big5?b?
HEADERS10CONTAINScharset=gb2312
HEADERS10CONTAINScharset="gb2312"
HEADERS10CONTAINScharset=big5
HEADERS10CONTAINScharset="big5"

BODY10CONTAINScharset=gb2312"
BODY10CONTAINScharset=3dgb2312"
BODY10CONTAINScharset=big5"
BODY10CONTAINScharset=3dbig5"
BODY10CONTAINScontent=zh-cn"
BODY10CONTAINScontent=3dzh-cn"


The END statements for the subject are meant as a precaution, although 
it's probably not necessary with the HIGHBIT filter ending on US-ASCII 
and ISO-8859-1 (plus a language definition hit for 'content="en-us"').

I do believe that you can apply a similar technique to spam in Spanish, 
but since the characterset is the same as English, you would be 
searching for those 'content=' markers in combination with special 
characters (a short list in this case).  We hardly see any Spanish spam, 
or at least held Spanish spam so I'm doing nothing about it.  Spanish is 
of course a lot more common in US E-mail.  It may be that some Spanish 
spam isn't identified as Spanish since that's not necessary for proper 
display in most E-mail clients, but I have seen no proof of that.

Matt



Scott Fisher wrote:

>Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the 
>headers. This is overwhelmingly SPAM, but like you siad there are English in some of 
>those messages.
>
>It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. 
>and an A0 to FF as it's lowbyte. 
>If the GB2312 Chinese is present, I would think most every character should be one of 
>these:
>°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷
>
>Checking some of my e-mails confirms that.
>
>The bad news is that requires another body filter. It's too bad there wasn't a 
>BODY256 filter type where only the first 256 bytes would be checked. That would 
>certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain 
>that I'd want to throw another body filter at my few Chinese spams.
>
>How often do you get a body indication of GB2312 / Cyrillic charactersets with no 
>header indication?
>
>It's an interesting subject because I those few Chinese spams that get through to 
>three of my accounts frustrate me.
>Got any tips for Spanish spam?
>
>Scott Fisher
>Director of IT
>Farm Progress Companies
>
>  
>
[EMAIL PROTECTED] 05/21/04 03:17PM >>>


>No, just one, but it won't score unless there is a header or body 
>indication of the GB2312 or Windows-1251 charactersets.  I'm using a 
>combo filter in Declude where the HIGHBIT filter is non-scoring, and the 
>CHINESE and CYRILLIC filters contain a line that says:
>
>TESTSFAILED  END  NOTCONTAINS  HIGHBIT
>
>I'm pretty sure that the CHINESE and CYRILLIC filters will always hit 
>where appropriate unless the HIGHBIT test doesn't hit.  I have about 65 
>different high bit characters in that filter presently, all copied from 
>spam.  If Scott was around, I would ask him how the NONENGLISH test is 
>tripped because that might accomplish the same goals, however I'm not 
>sure if it also scores the definition of a characterset, in which case 
>it would have false positives in this scenario.
>
>Matt
>
>
>
>Scott Fisher wrote:
>
>  
>
>>Interesting.
>>
>>Are you searching for 2 character pairs with GB2312?
>>
>>Scott Fisher
>>Director of IT
>>Farm Progress Companies
>>
>> 
>>
>>
>>
>[EMAIL PROTECTED] 05/21/04 01:46PM >>>
>   
>
>  
>
>>Scott,
>>
>>Regarding my Cyrillic and Chinese filters, I did a review of a full 
>>week's held spam, looking for foreign languages and patterns to tag.  I 
>>found from other re

Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?

2004-05-21 Thread Scott Fisher
Wouldn't it be better to reverse the order?

Run the subject and header tests on the majority of the mail.
Then run the body with a TESTSFAILED END NOTCONTAINS CHINESE. 
You should end up with less body searches this way.

Scott Fisher
Director of IT
Farm Progress Companies

>>> [EMAIL PROTECTED] 05/21/04 04:28PM >>>
I think you might have possibly identified the group of required 
characters.  I'll give that a try.  I'm not sure if any Cyrillic stuff 
has been passing through but this bears watching as well and I might 
have to change my list there as well.

I am also tagging BIG5, however almost all spam comes in GB2312.  Here's 
what I'm searching for in the CHINESE filter:

# CHINESE v1.0.0

SKIPIFWEIGHT25
MAXWEIGHT10

TESTSFAILEDENDNOTCONTAINSHIGHBIT

SUBJECTENDCONTAINScharset=gb2312
SUBJECTENDCONTAINScharset="gb2312"
SUBJECTENDCONTAINScharset=big5
SUBJECTENDCONTAINScharset="big5"

HEADERS10CONTAINS=?gb2312?b?
HEADERS10CONTAINS=?big5?b?
HEADERS10CONTAINScharset=gb2312
HEADERS10CONTAINScharset="gb2312"
HEADERS10CONTAINScharset=big5
HEADERS10CONTAINScharset="big5"

BODY10CONTAINScharset=gb2312"
BODY10CONTAINScharset=3dgb2312"
BODY10CONTAINScharset=big5"
BODY10CONTAINScharset=3dbig5"
BODY10CONTAINScontent=zh-cn"
BODY10CONTAINScontent=3dzh-cn"


The END statements for the subject are meant as a precaution, although 
it's probably not necessary with the HIGHBIT filter ending on US-ASCII 
and ISO-8859-1 (plus a language definition hit for 'content="en-us"').

I do believe that you can apply a similar technique to spam in Spanish, 
but since the characterset is the same as English, you would be 
searching for those 'content=' markers in combination with special 
characters (a short list in this case).  We hardly see any Spanish spam, 
or at least held Spanish spam so I'm doing nothing about it.  Spanish is 
of course a lot more common in US E-mail.  It may be that some Spanish 
spam isn't identified as Spanish since that's not necessary for proper 
display in most E-mail clients, but I have seen no proof of that.

Matt



Scott Fisher wrote:

>Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the 
>headers. This is overwhelmingly SPAM, but like you siad there are English in some of 
>those messages.
>
>It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. 
>and an A0 to FF as it's lowbyte. 
>If the GB2312 Chinese is present, I would think most every character should be one of 
>these:
>°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷
>
>Checking some of my e-mails confirms that.
>
>The bad news is that requires another body filter. It's too bad there wasn't a 
>BODY256 filter type where only the first 256 bytes would be checked. That would 
>certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain 
>that I'd want to throw another body filter at my few Chinese spams.
>
>How often do you get a body indication of GB2312 / Cyrillic charactersets with no 
>header indication?
>
>It's an interesting subject because I those few Chinese spams that get through to 
>three of my accounts frustrate me.
>Got any tips for Spanish spam?
>
>Scott Fisher
>Director of IT
>Farm Progress Companies
>
>  
>
[EMAIL PROTECTED] 05/21/04 03:17PM >>>


>No, just one, but it won't score unless there is a header or body 
>indication of the GB2312 or Windows-1251 charactersets.  I'm using a 
>combo filter in Declude where the HIGHBIT filter is non-scoring, and the 
>CHINESE and CYRILLIC filters contain a line that says:
>
>TESTSFAILED  END  NOTCONTAINS  HIGHBIT
>
>I'm pretty sure that the CHINESE and CYRILLIC filters will always hit 
>where appropriate unless the HIGHBIT test doesn't hit.  I have about 65 
>different high bit characters in that filter presently, all copied from 
>spam.  If Scott was around, I would ask him how the NONENGLISH test is 
>tripped because that might accomplish the same goals, however I'm not 
>sure if it also scores the definition of a characterset, in which case 
>it would have false positives in this scenario.
>
>Matt
>
>
>
>Scott Fisher wrote:
>
>  
>
>>Interesting.
>>
>>Are you searching for 2 character pairs with GB2312?
>>
>>Scott Fisher
>>Director of IT
>>Farm Progress Companies
>>
>> 
>>
>>
>>
>[EMAIL PROTECTED] 05/21/04 01:46PM >>>
>   
>
>  
>
>>Scott,
>>
>>Regarding my Cyrillic and Chinese filters, I did a review of a full 
>>week's held spam, looking for foreign languages and patterns to tag.  I 
>>found from other research that the primary Chinese characterset, GB2312, 
>>cont