Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?
Scott, All of those patterns came directly from the source code of E-mails. The charset=3dgb2312" pattern is quite common in the body of E-mail, in fact I think that it is FrontPage or another Microsoft product that encodes it this way when it appears in a META tag. Share and share alike :) Matt Scott Fisher wrote: I used this reference on names character sets: http://msdn.microsoft.com/library/default.asp?url=""> Would you really encounter a =3d ? Should it be charset3dgb2312? Enjoyed your thoughts, glad you didn't post this on a Monday... Scott Fisher Director of IT Farm Progress Companies [EMAIL PROTECTED] 05/21/04 04:28PM >>> I think you might have possibly identified the group of required characters. I'll give that a try. I'm not sure if any Cyrillic stuff has been passing through but this bears watching as well and I might have to change my list there as well. I am also tagging BIG5, however almost all spam comes in GB2312. Here's what I'm searching for in the CHINESE filter: # CHINESE v1.0.0 SKIPIFWEIGHT25 MAXWEIGHT10 TESTSFAILEDENDNOTCONTAINSHIGHBIT SUBJECTENDCONTAINScharset=gb2312 SUBJECTENDCONTAINScharset="gb2312" SUBJECTENDCONTAINScharset=big5 SUBJECTENDCONTAINScharset="big5" HEADERS10CONTAINS=?gb2312?b? HEADERS10CONTAINS=?big5?b? HEADERS10CONTAINScharset=gb2312 HEADERS10CONTAINScharset="gb2312" HEADERS10CONTAINScharset=big5 HEADERS10CONTAINScharset="big5" BODY10CONTAINScharset=gb2312" BODY10CONTAINScharset=3dgb2312" BODY10CONTAINScharset=big5" BODY10CONTAINScharset=3dbig5" BODY10CONTAINScontent=zh-cn" BODY10CONTAINScontent=3dzh-cn" The END statements for the subject are meant as a precaution, although it's probably not necessary with the HIGHBIT filter ending on US-ASCII and ISO-8859-1 (plus a language definition hit for 'content="en-us"'). I do believe that you can apply a similar technique to spam in Spanish, but since the characterset is the same as English, you would be searching for those 'content=' markers in combination with special characters (a short list in this case). We hardly see any Spanish spam, or at least held Spanish spam so I'm doing nothing about it. Spanish is of course a lot more common in US E-mail. It may be that some Spanish spam isn't identified as Spanish since that's not necessary for proper display in most E-mail clients, but I have seen no proof of that. Matt Scott Fisher wrote: Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the headers. This is overwhelmingly SPAM, but like you siad there are English in some of those messages. It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. and an A0 to FF as it's lowbyte. If the GB2312 Chinese is present, I would think most every character should be one of these: °±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷ Checking some of my e-mails confirms that. The bad news is that requires another body filter. It's too bad there wasn't a BODY256 filter type where only the first 256 bytes would be checked. That would certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain that I'd want to throw another body filter at my few Chinese spams. How often do you get a body indication of GB2312 / Cyrillic charactersets with no header indication? It's an interesting subject because I those few Chinese spams that get through to three of my accounts frustrate me. Got any tips for Spanish spam? Scott Fisher Director of IT Farm Progress Companies [EMAIL PROTECTED] 05/21/04 03:17PM >>> No, just one, but it won't score unless there is a header or body indication of the GB2312 or Windows-1251 charactersets. I'm using a combo filter in Declude where the HIGHBIT filter is non-scoring, and the CHINESE and CYRILLIC filters contain a line that says: TESTSFAILED END NOTCONTAINS HIGHBIT I'm pretty sure that the CHINESE and CYRILLIC filters will always hit where appropriate unless the HIGHBIT test doesn't hit. I have about 65 different high bit characters in that filter presently, all copied from spam. If Scott was around, I would ask him how the NONENGLISH test is tripped because that might accomplish the same goals, however I'm not sure if it also scores the definition of a characterset, in which case it would have false positives in this scenario. Matt Scott Fisher wrote: Interesting. Are you searching for 2 character pairs with GB2312? S
Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?
Not if you want to exclude certain domains from hitting individual languages. The test for HIGHBIT pre-qualifies the message, and then there are separate tests for CHINESE and CYRILLIC that can each be independently excluded. I would have to exclude both languages/charactersets if I reversed the order. There are currently 35 characters to be searched after updating for your list and trimming somewhat randomly for size. Considering the END statements for US-ASCII and ISO-8859-1, and SKIPIFWEIGHT settings, the full filter should rarely ever run. You could also do a TESTSFAILED END NOTCONTAINS NONENGLISH potentially, but I'm still not positive how that hits. Like I said before, I'm not sure if NONENGLISH hits on characters or the definition of charactersets, or both, and whether or not it has any limitations (like hitting on common extended English characters). If it only hits on what we consider high bit characters, the HIGHBIT test could be abandoned. BTW, I just reviewed in detail why some of this stuff was still appearing in my Hold file and the reason is two-fold. For one, I needed to add more points as all of the Chinese and Cyrillic charactersets that I was tagging did in fact hit these filters and scored 10 points. I was being somewhat cautious since this was a new filter and I did find that one false positive so I'll wait a bit longer to increase the score. The second reason why I was still seeing this some funky stuff was that some of the Cyrillic messages were encoded in KOI8-R and I wasn't filtering for that...but I am now :) I also added KOI8-U (the Ukrainian version) just for good measure. Matt Scott Fisher wrote: Wouldn't it be better to reverse the order? Run the subject and header tests on the majority of the mail. Then run the body with a TESTSFAILED END NOTCONTAINS CHINESE. You should end up with less body searches this way. Scott Fisher Director of IT Farm Progress Companies [EMAIL PROTECTED] 05/21/04 04:28PM >>> I think you might have possibly identified the group of required characters. I'll give that a try. I'm not sure if any Cyrillic stuff has been passing through but this bears watching as well and I might have to change my list there as well. I am also tagging BIG5, however almost all spam comes in GB2312. Here's what I'm searching for in the CHINESE filter: # CHINESE v1.0.0 SKIPIFWEIGHT25 MAXWEIGHT10 TESTSFAILEDENDNOTCONTAINSHIGHBIT SUBJECTENDCONTAINScharset=gb2312 SUBJECTENDCONTAINScharset="gb2312" SUBJECTENDCONTAINScharset=big5 SUBJECTENDCONTAINScharset="big5" HEADERS10CONTAINS=?gb2312?b? HEADERS10CONTAINS=?big5?b? HEADERS10CONTAINScharset=gb2312 HEADERS10CONTAINScharset="gb2312" HEADERS10CONTAINScharset=big5 HEADERS10CONTAINScharset="big5" BODY10CONTAINScharset=gb2312" BODY10CONTAINScharset=3dgb2312" BODY10CONTAINScharset=big5" BODY10CONTAINScharset=3dbig5" BODY10CONTAINScontent=zh-cn" BODY10CONTAINScontent=3dzh-cn" The END statements for the subject are meant as a precaution, although it's probably not necessary with the HIGHBIT filter ending on US-ASCII and ISO-8859-1 (plus a language definition hit for 'content="en-us"'). I do believe that you can apply a similar technique to spam in Spanish, but since the characterset is the same as English, you would be searching for those 'content=' markers in combination with special characters (a short list in this case). We hardly see any Spanish spam, or at least held Spanish spam so I'm doing nothing about it. Spanish is of course a lot more common in US E-mail. It may be that some Spanish spam isn't identified as Spanish since that's not necessary for proper display in most E-mail clients, but I have seen no proof of that. Matt Scott Fisher wrote: Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the headers. This is overwhelmingly SPAM, but like you siad there are English in some of those messages. It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. and an A0 to FF as it's lowbyte. If the GB2312 Chinese is present, I would think most every character should be one of these: °±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷ Checking some of my e-mails confirms that. The bad news is that requires another body filter. It's too bad there wasn't a BODY256 filter type where only the first 256 bytes would be checked. That would certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain that I'd want to throw another body filter at my few Chinese spams. How often
Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?
I used this reference on names character sets: http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/charsets/charset4.asp Would you really encounter a =3d ? Should it be charset3dgb2312? Enjoyed your thoughts, glad you didn't post this on a Monday... Scott Fisher Director of IT Farm Progress Companies >>> [EMAIL PROTECTED] 05/21/04 04:28PM >>> I think you might have possibly identified the group of required characters. I'll give that a try. I'm not sure if any Cyrillic stuff has been passing through but this bears watching as well and I might have to change my list there as well. I am also tagging BIG5, however almost all spam comes in GB2312. Here's what I'm searching for in the CHINESE filter: # CHINESE v1.0.0 SKIPIFWEIGHT25 MAXWEIGHT10 TESTSFAILEDENDNOTCONTAINSHIGHBIT SUBJECTENDCONTAINScharset=gb2312 SUBJECTENDCONTAINScharset="gb2312" SUBJECTENDCONTAINScharset=big5 SUBJECTENDCONTAINScharset="big5" HEADERS10CONTAINS=?gb2312?b? HEADERS10CONTAINS=?big5?b? HEADERS10CONTAINScharset=gb2312 HEADERS10CONTAINScharset="gb2312" HEADERS10CONTAINScharset=big5 HEADERS10CONTAINScharset="big5" BODY10CONTAINScharset=gb2312" BODY10CONTAINScharset=3dgb2312" BODY10CONTAINScharset=big5" BODY10CONTAINScharset=3dbig5" BODY10CONTAINScontent=zh-cn" BODY10CONTAINScontent=3dzh-cn" The END statements for the subject are meant as a precaution, although it's probably not necessary with the HIGHBIT filter ending on US-ASCII and ISO-8859-1 (plus a language definition hit for 'content="en-us"'). I do believe that you can apply a similar technique to spam in Spanish, but since the characterset is the same as English, you would be searching for those 'content=' markers in combination with special characters (a short list in this case). We hardly see any Spanish spam, or at least held Spanish spam so I'm doing nothing about it. Spanish is of course a lot more common in US E-mail. It may be that some Spanish spam isn't identified as Spanish since that's not necessary for proper display in most E-mail clients, but I have seen no proof of that. Matt Scott Fisher wrote: >Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the >headers. This is overwhelmingly SPAM, but like you siad there are English in some of >those messages. > >It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. >and an A0 to FF as it's lowbyte. >If the GB2312 Chinese is present, I would think most every character should be one of >these: >°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷ > >Checking some of my e-mails confirms that. > >The bad news is that requires another body filter. It's too bad there wasn't a >BODY256 filter type where only the first 256 bytes would be checked. That would >certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain >that I'd want to throw another body filter at my few Chinese spams. > >How often do you get a body indication of GB2312 / Cyrillic charactersets with no >header indication? > >It's an interesting subject because I those few Chinese spams that get through to >three of my accounts frustrate me. >Got any tips for Spanish spam? > >Scott Fisher >Director of IT >Farm Progress Companies > > > [EMAIL PROTECTED] 05/21/04 03:17PM >>> >No, just one, but it won't score unless there is a header or body >indication of the GB2312 or Windows-1251 charactersets. I'm using a >combo filter in Declude where the HIGHBIT filter is non-scoring, and the >CHINESE and CYRILLIC filters contain a line that says: > >TESTSFAILED END NOTCONTAINS HIGHBIT > >I'm pretty sure that the CHINESE and CYRILLIC filters will always hit >where appropriate unless the HIGHBIT test doesn't hit. I have about 65 >different high bit characters in that filter presently, all copied from >spam. If Scott was around, I would ask him how the NONENGLISH test is >tripped because that might accomplish the same goals, however I'm not >sure if it also scores the definition of a characterset, in which case >it would have false positives in this scenario. > >Matt > > > >Scott Fisher wrote: > > > >>Interesting. >> >>Are you searching for 2 character pairs with GB2312? >> >>Scott Fisher >>Director of IT >>Farm Progress Companies >> >> >> >> >> >[EMAIL PROTECTED] 05/21/04 01:46PM >>> > > > > >>Scott, >> >>Regarding my Cyrillic and Chinese filters, I did a review of a full >>week's held spam, looking for foreign languages and patterns to tag. I >>found from other re
Re: [sniffer] OT: Language filtering in Declude, wasPossibleblip?
Wouldn't it be better to reverse the order? Run the subject and header tests on the majority of the mail. Then run the body with a TESTSFAILED END NOTCONTAINS CHINESE. You should end up with less body searches this way. Scott Fisher Director of IT Farm Progress Companies >>> [EMAIL PROTECTED] 05/21/04 04:28PM >>> I think you might have possibly identified the group of required characters. I'll give that a try. I'm not sure if any Cyrillic stuff has been passing through but this bears watching as well and I might have to change my list there as well. I am also tagging BIG5, however almost all spam comes in GB2312. Here's what I'm searching for in the CHINESE filter: # CHINESE v1.0.0 SKIPIFWEIGHT25 MAXWEIGHT10 TESTSFAILEDENDNOTCONTAINSHIGHBIT SUBJECTENDCONTAINScharset=gb2312 SUBJECTENDCONTAINScharset="gb2312" SUBJECTENDCONTAINScharset=big5 SUBJECTENDCONTAINScharset="big5" HEADERS10CONTAINS=?gb2312?b? HEADERS10CONTAINS=?big5?b? HEADERS10CONTAINScharset=gb2312 HEADERS10CONTAINScharset="gb2312" HEADERS10CONTAINScharset=big5 HEADERS10CONTAINScharset="big5" BODY10CONTAINScharset=gb2312" BODY10CONTAINScharset=3dgb2312" BODY10CONTAINScharset=big5" BODY10CONTAINScharset=3dbig5" BODY10CONTAINScontent=zh-cn" BODY10CONTAINScontent=3dzh-cn" The END statements for the subject are meant as a precaution, although it's probably not necessary with the HIGHBIT filter ending on US-ASCII and ISO-8859-1 (plus a language definition hit for 'content="en-us"'). I do believe that you can apply a similar technique to spam in Spanish, but since the characterset is the same as English, you would be searching for those 'content=' markers in combination with special characters (a short list in this case). We hardly see any Spanish spam, or at least held Spanish spam so I'm doing nothing about it. Spanish is of course a lot more common in US E-mail. It may be that some Spanish spam isn't identified as Spanish since that's not necessary for proper display in most E-mail clients, but I have seen no proof of that. Matt Scott Fisher wrote: >Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the >headers. This is overwhelmingly SPAM, but like you siad there are English in some of >those messages. > >It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. >and an A0 to FF as it's lowbyte. >If the GB2312 Chinese is present, I would think most every character should be one of >these: >°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷ > >Checking some of my e-mails confirms that. > >The bad news is that requires another body filter. It's too bad there wasn't a >BODY256 filter type where only the first 256 bytes would be checked. That would >certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain >that I'd want to throw another body filter at my few Chinese spams. > >How often do you get a body indication of GB2312 / Cyrillic charactersets with no >header indication? > >It's an interesting subject because I those few Chinese spams that get through to >three of my accounts frustrate me. >Got any tips for Spanish spam? > >Scott Fisher >Director of IT >Farm Progress Companies > > > [EMAIL PROTECTED] 05/21/04 03:17PM >>> >No, just one, but it won't score unless there is a header or body >indication of the GB2312 or Windows-1251 charactersets. I'm using a >combo filter in Declude where the HIGHBIT filter is non-scoring, and the >CHINESE and CYRILLIC filters contain a line that says: > >TESTSFAILED END NOTCONTAINS HIGHBIT > >I'm pretty sure that the CHINESE and CYRILLIC filters will always hit >where appropriate unless the HIGHBIT test doesn't hit. I have about 65 >different high bit characters in that filter presently, all copied from >spam. If Scott was around, I would ask him how the NONENGLISH test is >tripped because that might accomplish the same goals, however I'm not >sure if it also scores the definition of a characterset, in which case >it would have false positives in this scenario. > >Matt > > > >Scott Fisher wrote: > > > >>Interesting. >> >>Are you searching for 2 character pairs with GB2312? >> >>Scott Fisher >>Director of IT >>Farm Progress Companies >> >> >> >> >> >[EMAIL PROTECTED] 05/21/04 01:46PM >>> > > > > >>Scott, >> >>Regarding my Cyrillic and Chinese filters, I did a review of a full >>week's held spam, looking for foreign languages and patterns to tag. I >>found from other research that the primary Chinese characterset, GB2312, >>cont