Interesting.

Are you searching for 2 character pairs with GB2312?

Scott Fisher
Director of IT
Farm Progress Companies

>>> [EMAIL PROTECTED] 05/21/04 01:46PM >>>
Scott,

Regarding my Cyrillic and Chinese filters, I did a review of a full 
week's held spam, looking for foreign languages and patterns to tag.  I 
found from other research that the primary Chinese characterset, GB2312, 
contains the Western Latin characterset, and so someone could send an 
E-mail with this characterset defined and still have English as the 
message.  Because of this I do more than just look for the offending 
characterset, I've built a combo filter that looks for both high bit 
characters such as ¥ as well as body or header hits for encoding of 
GB2312 (Chinese/Korean) or Windows-1251 (Cyrillic).  I also have Declude 
END statements for appearances of US-ASCII and ISO-8859-1, so messages 
like this one that are referencing such patterns won't trip the filter.  
It seems to be stopping about 80% to 90% of the stuff, but I'm guessing 
that the stuff that is getting through didn't hit one of the high bit 
characters in my filter and I might need to simply expand my list a 
bit.  Unfortunately I have no idea what characters are most common, so 
I'm just eyeballing it from sources.

I had one false positive on a Yahoo Groups posting that referenced 
163.com, a Chinese free Web mail provider that inserts Chinese language 
footers.  The message was in English, but encoded in GB2312 and didn't 
indicate any sign of English besides the actual text.  Because of this, 
I might throw in an exception for the word "the " (followed by a space) 
just as a test to see if text in English is present, but I have to 
review that.  This message was also BASE64 encoded and that might be an 
appropriate exception???  The last pattern that I might look at is using 
the new MailPolice test for identifying Web-mail providers, and 
excepting them from the filter because they have issues with encoding 
languages I've found.

Hope this helps.

Matt



Scott Fisher wrote:

>2 thoughts from me:
>
>1. Right on on the Nigerian scams, possible keeping these rules longer. As I was 
>forwarding out a Nigerian scam to the spam mailbox, I too wondered how long the 
>Nigerian rules were kept in play. I might also add Nigeria's twin sister the 
>International Lottery spam and Stock Spams might also be kept longer. I noticed an 
>increase in the Stock spams this week. 
>
>2. I've been tracking different character sets for a couple of weeks, the Chinese, 
>Cyrillic and Korean look promising. I get false hits on Greek, Thai, and Vietnamese 
>Headers.
>
>Scott Fisher
>Director of IT
>Farm Progress Companies
>
>  
>
>>>>[EMAIL PROTECTED] 05/21/04 12:42PM >>>
>>>>        
>>>>
>Pete,
>
>Our Hold range has returned to more normal territory on Thursday.  
>Here's the stats from the week as a whole on what has been very 
>consistent traffic.  Out of all E-mail processed, both good and bad, the 
>%Hold represents what scored between 10-24 points on our system and 
>needed review, the %Sniffer represents all Sniffer hits except for Gray, 
>the %Spam is what we scanned and didn't deliver (generally about 99.8% 
>of spam is caught at a score of 10 which this is based on), and the 
>Sniffer/Spam is the percentage of Sniffer hits as a portion of messages 
>scoring 10 or more.
>
>    Day      %Hold    %Sniffer    %Spam    Sniffer/Spam
>    Mon:     1.86%     77.27%     80.37%     96.14%
>    Tue:     2.83%     74.53%     79.37%     93.39%
>    Wed:     2.13%     77.60%     79.66%     97.41%
>    Thur:    1.95%     76.50%     80.66%     94.84%
>
>The only change that we made to our system was to add two smaller 
>domains later in the week, and we introduced filters for Cyrillic and 
>Chinese languages on Wednesday morning which have cut our hold file down 
>by 0.38 percentage points on Thursday, which explains how our %Hold is 
>lower on than on Wednesday with a lower Sniffer hit rate on spam.
>
>I did note two high volume untagged static spammers on Tuesday that we 
>blacklisted locally, and that combined with the increase in Sniffer 
>change rates (spam storm) might account for the changes that I saw.  I 
>am wondering though about the recommendations that you have made for 
>possibly fine tuning our rule base.  Again though, please keep in mind 
>that I still feel that performance is overall very, very good.
>
>One of my thoughts regarding minimum rule strengths and grace periods is 
>that all groups aren't necessarily the same.  For instance Nigerian 
>scams are low volume and sporadic, and my system performs the worst on 
>these things.  Maybe lower rule strengths and longer grace periods makes 
>much more sense for the Phishing category than it does for many other 
>categories for instance.  Is that possible?
>
>I also looked up the rule strengths on your site and found that about 
>50%, or maybe more, have a strength below 1, and maybe lowering that is 
>worth testing out so long as I don't massively increase the number of 
>records.  I do think though that I would like to test out extending the 
>grace period.  Most of my false positives are not on things that this 
>would affect, and that might give niche sources a little extra coverage 
>if I understand things correctly.
>
>I'll follow your directions and contact you directly regarding any 
>affirmative changes, but I thought it might be beneficial to keep this 
>discussion public since some other stats hounds might find this 
>information to be of use :)
>
>If you can glean anything from the numbers that I gave you, please add 
>your thoughts.
>
>Thanks,
>
>Matt
>
>
>
>
>
>Pete McNeil wrote:
>
>  
>
>>At 05:00 PM 5/19/2004, you wrote:
>>
>><snip/>
>>
>>    
>>
>>>I haven't yet upgraded to the most recent release, I'm still on the 
>>>prior beta.  I'll probably do that this evening.  I tend to wait on 
>>>upgrades until there has been enough time for bugs to surface unless 
>>>I am already looking for a fix.  I'm sure that the extra verification 
>>>of the rulebase will help prevent the potential of problems, and I 
>>>guess this has the possibility of being caused by a bit of corrupted 
>>>data, though that's probably reaching.
>>>      
>>>
>>There were no substantive changes from the beta to the production 
>>version. Largely just a removal of monitoring code.
>>
>>    
>>
>>>Again, regardless if there was a blip, Sniffer still does a wonderful 
>>>job of tagging lots and lots of E-mail, just not quite as much as the 
>>>day before.
>>>      
>>>
>>Last night I was able to adjust the rule strength analysis window back 
>>to it's original settings. About 5 days of data were lost - but those 
>>days will be recovered quickly. Please let me know if this adjustment 
>>improved your conditions.
>>
>>I've noted that on a number of other lists there seem to be posts 
>>about a sudden increase in spam over the past few days. We are 
>>definitely seeing this also - approximately a 25% or more increase in 
>>new rule additions in the past 4 days:
>>
>>http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp 
>>
>>Specifically note from about 4 days ago...
>>
>>
>>Days Ago Adjustments
>>-------- -----------
>>
>>0        356
>>1        508
>>2        391
>>3        410
>>4        410
>>5        326
>>6        309
>>7        371
>>8        292
>>9        347
>>10       309
>>
>> 
>>
>>( 5-10 : 1954/6 -> 325.67, 0-5 : 2075/5 -> 415, 325.67/415 -> 78.47 )
>>Note that day 0 is not complete. So applying a "fudge factor" 78.4 
>>_looks like_ 75%.
>>Besides, 92% of statistics are made up on the spot anyway %^b
>>I think a number of things are combined here... I just want to get a 
>>good handle on them and make sure we are doing the best we can.
>>
>>I've noted, Matt, that your rulebase tuning parameters are set at the 
>>defaults. If you would like to adjust these to be more aggressive then 
>>please let me know off list (support@). More aggressive settings will 
>>keep more rules active in your rulebase at lower strengths and will 
>>also allow new rules more time to gain strength before being 
>>evaluated. Respectively the current defaults are:
>>
>>Minimum Rule Strength: 1.0
>>Grace Period: 5 days.
>>
>>Adjusting these settings can significantly increase the size of your 
>>rulebase file.
>>
>>Best,
>>_M
>>
>>    
>>
>
>  
>

-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/ 
=====================================================


This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Reply via email to