Re: [sniffer] OT: Language filtering in Declude, was Possibleblip?

Matt Fri, 21 May 2004 14:29:22 -0700

I think you might have possibly identified the group of required characters. I'll give that a try. I'm not sure if any Cyrillic stuff has been passing through but this bears watching as well and I might have to change my list there as well.

I am also tagging BIG5, however almost all spam comes in GB2312. Here's what I'm searching for in the CHINESE filter:

# CHINESE v1.0.0

SKIPIFWEIGHT    25
MAXWEIGHT    10

TESTSFAILED    END    NOTCONTAINS    HIGHBIT

SUBJECT        END    CONTAINS    charset=gb2312
SUBJECT        END    CONTAINS    charset="gb2312"
SUBJECT        END    CONTAINS    charset=big5
SUBJECT        END    CONTAINS    charset="big5"

HEADERS        10    CONTAINS    =?gb2312?b?
HEADERS        10    CONTAINS    =?big5?b?
HEADERS        10    CONTAINS    charset=gb2312
HEADERS        10    CONTAINS    charset="gb2312"
HEADERS        10    CONTAINS    charset=big5
HEADERS        10    CONTAINS    charset="big5"

BODY        10    CONTAINS    charset=gb2312"
BODY        10    CONTAINS    charset=3dgb2312"
BODY        10    CONTAINS    charset=big5"
BODY        10    CONTAINS    charset=3dbig5"
BODY        10    CONTAINS    content=zh-cn"
BODY        10    CONTAINS    content=3dzh-cn"

The END statements for the subject are meant as a precaution, although it's probably not necessary with the HIGHBIT filter ending on US-ASCII and ISO-8859-1 (plus a language definition hit for 'content="en-us"').

I do believe that you can apply a similar technique to spam in Spanish, but since the characterset is the same as English, you would be searching for those 'content=' markers in combination with special characters (a short list in this case). We hardly see any Spanish spam, or at least held Spanish spam so I'm doing nothing about it. Spanish is of course a lot more common in US E-mail. It may be that some Spanish spam isn't identified as Spanish since that's not necessary for proper display in most E-mail clients, but I have seen no proof of that.

Matt

Scott Fisher wrote:

Interesting. I generally just punish people if GB2312 ?BIG5 or such are in the headers. This is overwhelmingly SPAM, but like you siad there are English in some of those messages.


It looks like the GB2312 Chinese characters will have A B0 to F7 as it's highbyte. 
and an A0 to FF as it's lowbyte. 
If the GB2312 Chinese is present, I would think most every character should be one of these:
°±²³ µ¶· *º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷

Checking some of my e-mails confirms that.

The bad news is that requires another body filter. It's too bad there wasn't a BODY256 filter type where only the first 256 bytes would be checked. That would certainly be enough to score up these, and wouldn't be a CPU hog. I'm not certain that I'd want to throw another body filter at my few Chinese spams.

How often do you get a body indication of GB2312 / Cyrillic charactersets with no header indication?

It's an interesting subject because I those few Chinese spams that get through to three of my accounts frustrate me.
Got any tips for Spanish spam?

Scott Fisher
Director of IT
Farm Progress Companies

[EMAIL PROTECTED] 05/21/04 03:17PM >>>

No, just one, but it won't score unless there is a header or body 
indication of the GB2312 or Windows-1251 charactersets.  I'm using a 
combo filter in Declude where the HIGHBIT filter is non-scoring, and the 
CHINESE and CYRILLIC filters contain a line that says:

    TESTSFAILED      END      NOTCONTAINS      HIGHBIT

I'm pretty sure that the CHINESE and CYRILLIC filters will always hit 
where appropriate unless the HIGHBIT test doesn't hit.  I have about 65 
different high bit characters in that filter presently, all copied from 
spam.  If Scott was around, I would ask him how the NONENGLISH test is 
tripped because that might accomplish the same goals, however I'm not 
sure if it also scores the definition of a characterset, in which case 
it would have false positives in this scenario.

Matt



Scott Fisher wrote:

Interesting.

Are you searching for 2 character pairs with GB2312?

Scott Fisher
Director of IT
Farm Progress Companies

[EMAIL PROTECTED] 05/21/04 01:46PM >>>

Scott,

Regarding my Cyrillic and Chinese filters, I did a review of a full 
week's held spam, looking for foreign languages and patterns to tag.  I 
found from other research that the primary Chinese characterset, GB2312, 
contains the Western Latin characterset, and so someone could send an 
E-mail with this characterset defined and still have English as the 
message.  Because of this I do more than just look for the offending 
characterset, I've built a combo filter that looks for both high bit 
characters such as ¥ as well as body or header hits for encoding of 
GB2312 (Chinese/Korean) or Windows-1251 (Cyrillic).  I also have Declude 
END statements for appearances of US-ASCII and ISO-8859-1, so messages 
like this one that are referencing such patterns won't trip the filter.  
It seems to be stopping about 80% to 90% of the stuff, but I'm guessing 
that the stuff that is getting through didn't hit one of the high bit 
characters in my filter and I might need to simply expand my list a 
bit.  Unfortunately I have no idea what characters are most common, so 
I'm just eyeballing it from sources.

I had one false positive on a Yahoo Groups posting that referenced 
163.com, a Chinese free Web mail provider that inserts Chinese language 
footers.  The message was in English, but encoded in GB2312 and didn't 
indicate any sign of English besides the actual text.  Because of this, 
I might throw in an exception for the word "the " (followed by a space) 
just as a test to see if text in English is present, but I have to 
review that.  This message was also BASE64 encoded and that might be an 
appropriate exception???  The last pattern that I might look at is using 
the new MailPolice test for identifying Web-mail providers, and 
excepting them from the filter because they have issues with encoding 
languages I've found.

Hope this helps.

Matt



Scott Fisher wrote:

2 thoughts from me:

1. Right on on the Nigerian scams, possible keeping these rules longer. As I was forwarding out a Nigerian scam to the spam mailbox, I too wondered how long the Nigerian rules were kept in play. I might also add Nigeria's twin sister the International Lottery spam and Stock Spams might also be kept longer. I noticed an increase in the Stock spams this week. 

2. I've been tracking different character sets for a couple of weeks, the Chinese, Cyrillic and Korean look promising. I get false hits on Greek, Thai, and Vietnamese Headers.

Scott Fisher
Director of IT
Farm Progress Companies

[EMAIL PROTECTED] 05/21/04 12:42PM >>>

Pete,

Our Hold range has returned to more normal territory on Thursday.  
Here's the stats from the week as a whole on what has been very 
consistent traffic.  Out of all E-mail processed, both good and bad, the 
%Hold represents what scored between 10-24 points on our system and 
needed review, the %Sniffer represents all Sniffer hits except for Gray, 
the %Spam is what we scanned and didn't deliver (generally about 99.8% 
of spam is caught at a score of 10 which this is based on), and the 
Sniffer/Spam is the percentage of Sniffer hits as a portion of messages 
scoring 10 or more.

  Day      %Hold    %Sniffer    %Spam    Sniffer/Spam
  Mon:     1.86%     77.27%     80.37%     96.14%
  Tue:     2.83%     74.53%     79.37%     93.39%
  Wed:     2.13%     77.60%     79.66%     97.41%
  Thur:    1.95%     76.50%     80.66%     94.84%

The only change that we made to our system was to add two smaller 
domains later in the week, and we introduced filters for Cyrillic and 
Chinese languages on Wednesday morning which have cut our hold file down 
by 0.38 percentage points on Thursday, which explains how our %Hold is 
lower on than on Wednesday with a lower Sniffer hit rate on spam.

I did note two high volume untagged static spammers on Tuesday that we 
blacklisted locally, and that combined with the increase in Sniffer 
change rates (spam storm) might account for the changes that I saw.  I 
am wondering though about the recommendations that you have made for 
possibly fine tuning our rule base.  Again though, please keep in mind 
that I still feel that performance is overall very, very good.

One of my thoughts regarding minimum rule strengths and grace periods is 
that all groups aren't necessarily the same.  For instance Nigerian 
scams are low volume and sporadic, and my system performs the worst on 
these things.  Maybe lower rule strengths and longer grace periods makes 
much more sense for the Phishing category than it does for many other 
categories for instance.  Is that possible?

I also looked up the rule strengths on your site and found that about 
50%, or maybe more, have a strength below 1, and maybe lowering that is 
worth testing out so long as I don't massively increase the number of 
records.  I do think though that I would like to test out extending the 
grace period.  Most of my false positives are not on things that this 
would affect, and that might give niche sources a little extra coverage 
if I understand things correctly.

I'll follow your directions and contact you directly regarding any 
affirmative changes, but I thought it might be beneficial to keep this 
discussion public since some other stats hounds might find this 
information to be of use :)

If you can glean anything from the numbers that I gave you, please add 
your thoughts.

Thanks,

Matt





Pete McNeil wrote:

At 05:00 PM 5/19/2004, you wrote:

<snip/>

I haven't yet upgraded to the most recent release, I'm still on the 
prior beta.  I'll probably do that this evening.  I tend to wait on 
upgrades until there has been enough time for bugs to surface unless 
I am already looking for a fix.  I'm sure that the extra verification 
of the rulebase will help prevent the potential of problems, and I 
guess this has the possibility of being caused by a bit of corrupted 
data, though that's probably reaching.

There were no substantive changes from the beta to the production 
version. Largely just a removal of monitoring code.

Again, regardless if there was a blip, Sniffer still does a wonderful 
job of tagging lots and lots of E-mail, just not quite as much as the 
day before.

Last night I was able to adjust the rule strength analysis window back 
to it's original settings. About 5 days of data were lost - but those 
days will be recovered quickly. Please let me know if this adjustment 
improved your conditions.

I've noted that on a number of other lists there seem to be posts 
about a sudden increase in spam over the past few days. We are 
definitely seeing this also - approximately a 25% or more increase in 
new rule additions in the past 4 days:

http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp 

Specifically note from about 4 days ago...


Days Ago Adjustments
-------- -----------

0        356
1        508
2        391
3        410
4        410
5        326
6        309
7        371
8        292
9        347
10       309



( 5-10 : 1954/6 -> 325.67, 0-5 : 2075/5 -> 415, 325.67/415 -> 78.47 )
Note that day 0 is not complete. So applying a "fudge factor" 78.4 
_looks like_ 75%.
Besides, 92% of statistics are made up on the spot anyway %^b
I think a number of things are combined here... I just want to get a 
good handle on them and make sure we are doing the best we can.

I've noted, Matt, that your rulebase tuning parameters are set at the 
defaults. If you would like to adjust these to be more aggressive then 
please let me know off list (support@). More aggressive settings will for th
keep more rules active in your rulebase at lower strengths and will 
also allow new rules more time to gain strength before being 
evaluated. Respectively the current defaults are:

Minimum Rule Strength: 1.0
Grace Period: 5 days.

Adjusting these settings can significantly increase the size of your 
rulebase file.

Best,
_M

-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=====================================================

Re: [sniffer] OT: Language filtering in Declude, was Possibleblip?

Reply via email to