Re: [sniffer] Possible blip?

2004-05-21 Thread Pete McNeil


At 01:42 PM 5/21/2004, you wrote:
Pete,

Our Hold range has returned to more normal territory on Thursday. 
Here's the stats from 

One of my thoughts regarding
minimum rule strengths and grace periods is that all groups aren't
necessarily the same.  For instance Nigerian scams are low volume
and sporadic, and my system performs the worst on these things. 
Maybe lower rule strengths and longer grace periods makes much more sense
for the Phishing category than it does for many other categories for
instance.  Is that possible?
These are definitely some things to look at - great food for new research
projects.
There is a great diversity - luckily the scanning engine has a huge
amount of headroom so most of the time we don't need to tune things very
precisely. In any of the categories you mention we see some rules die
immediately, and others seem to live on forever - often without a great
deal of reason for either case.
The fact that your hold range returned after we adjusted the rule
strength calculation window is a good indication that the relevant tuning
parameter is minimum rule strength. I noted that the previous adjustment
(changing the window from 45 to 35 days) happened precisely one month
ago. This strongly suggested that we were seeing a "wave front"
of sorts pass through the tuning system - so on a hunch I put it back to
45. Your report helps to support this conjecture. 
The grace period value has the greatest effect early on in a rule's life
cycle and probably shouldn't be extended beyond about 10 days. The design
of the grace period feature is that it gives a new rule time for it's
rule strength to rise to the minimum threshold. After that it's all about
the performance of the rule. This sets up a competitive environment in
the system. Reaching a threshold of 1.0 currently requires that at least
19 messages fail on that rule within the analysis window and on one of
the systems that are providing logs for analysis. With about 110 logs
being consistently reported there are plenty of chances for 19 hits to
happen. 
[ an "ordinary" reporting system processes about 1300
messages per hour with sniffer spending about 190ms of computing time per
message (or about 7% of the available computing time). In 5 days a rule
has about 1716 opportunities to "kill" a message. To stay
alive, a rule need only achieve a kill about .00011655% (one ten
thousandth of a percent) of the time. Of course, these numbers are a lot
like the average US family having 2.3 kids - ever seen .3 of a kid? ---
but the scale of the numbers seems right. ]
It could be argued that if a rule can't account for at least that
many hits across 110 systems in 5 days then it's not going to be
missed... The counter to this argument is that the spammers are driving
toward diversity to make filtering systems of all types difficult to
train and maintain -- as you noted, half of the active rules in the
default configuration are in this very low strength range.
I also looked up the rule
strengths on your site and found that about 50%, or maybe more, have a
strength below 1, and maybe lowering that is worth testing out so long as
I don't massively increase the number of records.  I do think though
that I would like to test out extending the grace period.  Most of
my false positives are not on things that this would affect, and that
might give niche sources a little extra coverage if I understand things
correctly.
Possibly - but I think an adjustment in the minimum rule strength will
probably suffice given the sensitivity at that range. For example, if you
adjust your minimum rule strength to 0.8 then on 10 credited kills would
be required over a period of 5 days on 110 systems in order to push the
rule above the strength threshold. Thereafter it would remain in place
for at least 45 days (with the current settings) --- each of those days
providing another opportunity to increase or maintain it's
strength...
There is also another mechanism at work here --- our core system scans
every presumed ham message one more time with every rule in the system
(min rule strength 0). The log from this scan is injected into the normal
analysis so that if a message matching a deactivated rule reaches our
system through any path the strength for that rule will be raised above
0.
The second stage of the reactivation process then kicks in because our
system normally scans messages with a minimum rule strength of 0.1 - so
any messages that were being missed will continue to rise in strength if
they are seen in any volume in our spam traps or submitted 
spam.
Once we see 20 instances every system will begin using the reactivated
rule... Some systems will begin even before that because they are using
more sensitive settings in their rulebases - this fact helps to
accelerate the process.
Anyway, a long story short - I think the first thing to try is adjusting
the Minimum Rule Strength. This is by far the most sensitive setting -
though the two do interact dynamically - es

Re: [sniffer] Possible blip?

2004-05-21 Thread Scott Fisher
Interesting.

Are you searching for 2 character pairs with GB2312?

Scott Fisher
Director of IT
Farm Progress Companies

>>> [EMAIL PROTECTED] 05/21/04 01:46PM >>>
Scott,

Regarding my Cyrillic and Chinese filters, I did a review of a full 
week's held spam, looking for foreign languages and patterns to tag.  I 
found from other research that the primary Chinese characterset, GB2312, 
contains the Western Latin characterset, and so someone could send an 
E-mail with this characterset defined and still have English as the 
message.  Because of this I do more than just look for the offending 
characterset, I've built a combo filter that looks for both high bit 
characters such as ¥ as well as body or header hits for encoding of 
GB2312 (Chinese/Korean) or Windows-1251 (Cyrillic).  I also have Declude 
END statements for appearances of US-ASCII and ISO-8859-1, so messages 
like this one that are referencing such patterns won't trip the filter.  
It seems to be stopping about 80% to 90% of the stuff, but I'm guessing 
that the stuff that is getting through didn't hit one of the high bit 
characters in my filter and I might need to simply expand my list a 
bit.  Unfortunately I have no idea what characters are most common, so 
I'm just eyeballing it from sources.

I had one false positive on a Yahoo Groups posting that referenced 
163.com, a Chinese free Web mail provider that inserts Chinese language 
footers.  The message was in English, but encoded in GB2312 and didn't 
indicate any sign of English besides the actual text.  Because of this, 
I might throw in an exception for the word "the " (followed by a space) 
just as a test to see if text in English is present, but I have to 
review that.  This message was also BASE64 encoded and that might be an 
appropriate exception???  The last pattern that I might look at is using 
the new MailPolice test for identifying Web-mail providers, and 
excepting them from the filter because they have issues with encoding 
languages I've found.

Hope this helps.

Matt



Scott Fisher wrote:

>2 thoughts from me:
>
>1. Right on on the Nigerian scams, possible keeping these rules longer. As I was 
>forwarding out a Nigerian scam to the spam mailbox, I too wondered how long the 
>Nigerian rules were kept in play. I might also add Nigeria's twin sister the 
>International Lottery spam and Stock Spams might also be kept longer. I noticed an 
>increase in the Stock spams this week. 
>
>2. I've been tracking different character sets for a couple of weeks, the Chinese, 
>Cyrillic and Korean look promising. I get false hits on Greek, Thai, and Vietnamese 
>Headers.
>
>Scott Fisher
>Director of IT
>Farm Progress Companies
>
>  
>
[EMAIL PROTECTED] 05/21/04 12:42PM >>>


>Pete,
>
>Our Hold range has returned to more normal territory on Thursday.  
>Here's the stats from the week as a whole on what has been very 
>consistent traffic.  Out of all E-mail processed, both good and bad, the 
>%Hold represents what scored between 10-24 points on our system and 
>needed review, the %Sniffer represents all Sniffer hits except for Gray, 
>the %Spam is what we scanned and didn't deliver (generally about 99.8% 
>of spam is caught at a score of 10 which this is based on), and the 
>Sniffer/Spam is the percentage of Sniffer hits as a portion of messages 
>scoring 10 or more.
>
>Day  %Hold%Sniffer%SpamSniffer/Spam
>Mon: 1.86% 77.27% 80.37% 96.14%
>Tue: 2.83% 74.53% 79.37% 93.39%
>Wed: 2.13% 77.60% 79.66% 97.41%
>Thur:1.95% 76.50% 80.66% 94.84%
>
>The only change that we made to our system was to add two smaller 
>domains later in the week, and we introduced filters for Cyrillic and 
>Chinese languages on Wednesday morning which have cut our hold file down 
>by 0.38 percentage points on Thursday, which explains how our %Hold is 
>lower on than on Wednesday with a lower Sniffer hit rate on spam.
>
>I did note two high volume untagged static spammers on Tuesday that we 
>blacklisted locally, and that combined with the increase in Sniffer 
>change rates (spam storm) might account for the changes that I saw.  I 
>am wondering though about the recommendations that you have made for 
>possibly fine tuning our rule base.  Again though, please keep in mind 
>that I still feel that performance is overall very, very good.
>
>One of my thoughts regarding minimum rule strengths and grace periods is 
>that all groups aren't necessarily the same.  For instance Nigerian 
>scams are low volume and sporadic, and my system performs the worst on 
>these things.  Maybe lower rule strengths and longer grace periods makes 
>much more sense for the Phishing category than it does for many other 
>categories for instance.  Is that possible?
>
>I also looked up the rule strengths on your site and found that about 
>50%, or maybe more, have a strength below 1, and maybe lowering that is 
>worth testing out so

Re: [sniffer] Possible blip?

2004-05-21 Thread Matt




Scott,

Regarding my Cyrillic and Chinese filters, I did a review of a full
week's held spam, looking for foreign languages and patterns to tag.  I
found from other research that the primary Chinese characterset,
GB2312, contains the Western Latin characterset, and so someone could
send an E-mail with this characterset defined and still have English as
the message.  Because of this I do more than just look for the
offending characterset, I've built a combo filter that looks for both
high bit characters such as ¥ as well as body or header hits for
encoding of GB2312 (Chinese/Korean) or Windows-1251 (Cyrillic).  I also
have Declude END statements for appearances of US-ASCII and ISO-8859-1,
so messages like this one that are referencing such patterns won't trip
the filter.  It seems to be stopping about 80% to 90% of the stuff, but
I'm guessing that the stuff that is getting through didn't hit one of
the high bit characters in my filter and I might need to simply expand
my list a bit.  Unfortunately I have no idea what characters are most
common, so I'm just eyeballing it from sources.

I had one false positive on a Yahoo Groups posting that referenced
163.com, a Chinese free Web mail provider that inserts Chinese language
footers.  The message was in English, but encoded in GB2312 and didn't
indicate any sign of English besides the actual text.  Because of this,
I might throw in an exception for the word "the " (followed by a space)
just as a test to see if text in English is present, but I have to
review that.  This message was also BASE64 encoded and that might be an
appropriate exception???  The last pattern that I might look at is
using the new MailPolice test for identifying Web-mail providers, and
excepting them from the filter because they have issues with encoding
languages I've found.

Hope this helps.

Matt



Scott Fisher wrote:

  2 thoughts from me:

1. Right on on the Nigerian scams, possible keeping these rules longer. As I was forwarding out a Nigerian scam to the spam mailbox, I too wondered how long the Nigerian rules were kept in play. I might also add Nigeria's twin sister the International Lottery spam and Stock Spams might also be kept longer. I noticed an increase in the Stock spams this week. 

2. I've been tracking different character sets for a couple of weeks, the Chinese, Cyrillic and Korean look promising. I get false hits on Greek, Thai, and Vietnamese Headers.

Scott Fisher
Director of IT
Farm Progress Companies

  
  

  
[EMAIL PROTECTED] 05/21/04 12:42PM >>>

  

  
  Pete,

Our Hold range has returned to more normal territory on Thursday.  
Here's the stats from the week as a whole on what has been very 
consistent traffic.  Out of all E-mail processed, both good and bad, the 
%Hold represents what scored between 10-24 points on our system and 
needed review, the %Sniffer represents all Sniffer hits except for Gray, 
the %Spam is what we scanned and didn't deliver (generally about 99.8% 
of spam is caught at a score of 10 which this is based on), and the 
Sniffer/Spam is the percentage of Sniffer hits as a portion of messages 
scoring 10 or more.

Day  %Hold%Sniffer%SpamSniffer/Spam
Mon: 1.86% 77.27% 80.37% 96.14%
Tue: 2.83% 74.53% 79.37% 93.39%
Wed: 2.13% 77.60% 79.66% 97.41%
Thur:1.95% 76.50% 80.66% 94.84%

The only change that we made to our system was to add two smaller 
domains later in the week, and we introduced filters for Cyrillic and 
Chinese languages on Wednesday morning which have cut our hold file down 
by 0.38 percentage points on Thursday, which explains how our %Hold is 
lower on than on Wednesday with a lower Sniffer hit rate on spam.

I did note two high volume untagged static spammers on Tuesday that we 
blacklisted locally, and that combined with the increase in Sniffer 
change rates (spam storm) might account for the changes that I saw.  I 
am wondering though about the recommendations that you have made for 
possibly fine tuning our rule base.  Again though, please keep in mind 
that I still feel that performance is overall very, very good.

One of my thoughts regarding minimum rule strengths and grace periods is 
that all groups aren't necessarily the same.  For instance Nigerian 
scams are low volume and sporadic, and my system performs the worst on 
these things.  Maybe lower rule strengths and longer grace periods makes 
much more sense for the Phishing category than it does for many other 
categories for instance.  Is that possible?

I also looked up the rule strengths on your site and found that about 
50%, or maybe more, have a strength below 1, and maybe lowering that is 
worth testing out so long as I don't massively increase the number of 
records.  I do think though that I would like to test out extending the 
grace period.  Most of my false positives are not on things that this 
would affect, and that migh

Re: [sniffer] Possible blip?

2004-05-21 Thread Scott Fisher
2 thoughts from me:

1. Right on on the Nigerian scams, possible keeping these rules longer. As I was 
forwarding out a Nigerian scam to the spam mailbox, I too wondered how long the 
Nigerian rules were kept in play. I might also add Nigeria's twin sister the 
International Lottery spam and Stock Spams might also be kept longer. I noticed an 
increase in the Stock spams this week. 

2. I've been tracking different character sets for a couple of weeks, the Chinese, 
Cyrillic and Korean look promising. I get false hits on Greek, Thai, and Vietnamese 
Headers.

Scott Fisher
Director of IT
Farm Progress Companies

>>> [EMAIL PROTECTED] 05/21/04 12:42PM >>>
Pete,

Our Hold range has returned to more normal territory on Thursday.  
Here's the stats from the week as a whole on what has been very 
consistent traffic.  Out of all E-mail processed, both good and bad, the 
%Hold represents what scored between 10-24 points on our system and 
needed review, the %Sniffer represents all Sniffer hits except for Gray, 
the %Spam is what we scanned and didn't deliver (generally about 99.8% 
of spam is caught at a score of 10 which this is based on), and the 
Sniffer/Spam is the percentage of Sniffer hits as a portion of messages 
scoring 10 or more.

Day  %Hold%Sniffer%SpamSniffer/Spam
Mon: 1.86% 77.27% 80.37% 96.14%
Tue: 2.83% 74.53% 79.37% 93.39%
Wed: 2.13% 77.60% 79.66% 97.41%
Thur:1.95% 76.50% 80.66% 94.84%

The only change that we made to our system was to add two smaller 
domains later in the week, and we introduced filters for Cyrillic and 
Chinese languages on Wednesday morning which have cut our hold file down 
by 0.38 percentage points on Thursday, which explains how our %Hold is 
lower on than on Wednesday with a lower Sniffer hit rate on spam.

I did note two high volume untagged static spammers on Tuesday that we 
blacklisted locally, and that combined with the increase in Sniffer 
change rates (spam storm) might account for the changes that I saw.  I 
am wondering though about the recommendations that you have made for 
possibly fine tuning our rule base.  Again though, please keep in mind 
that I still feel that performance is overall very, very good.

One of my thoughts regarding minimum rule strengths and grace periods is 
that all groups aren't necessarily the same.  For instance Nigerian 
scams are low volume and sporadic, and my system performs the worst on 
these things.  Maybe lower rule strengths and longer grace periods makes 
much more sense for the Phishing category than it does for many other 
categories for instance.  Is that possible?

I also looked up the rule strengths on your site and found that about 
50%, or maybe more, have a strength below 1, and maybe lowering that is 
worth testing out so long as I don't massively increase the number of 
records.  I do think though that I would like to test out extending the 
grace period.  Most of my false positives are not on things that this 
would affect, and that might give niche sources a little extra coverage 
if I understand things correctly.

I'll follow your directions and contact you directly regarding any 
affirmative changes, but I thought it might be beneficial to keep this 
discussion public since some other stats hounds might find this 
information to be of use :)

If you can glean anything from the numbers that I gave you, please add 
your thoughts.

Thanks,

Matt





Pete McNeil wrote:

> At 05:00 PM 5/19/2004, you wrote:
>
> 
>
>> I haven't yet upgraded to the most recent release, I'm still on the 
>> prior beta.  I'll probably do that this evening.  I tend to wait on 
>> upgrades until there has been enough time for bugs to surface unless 
>> I am already looking for a fix.  I'm sure that the extra verification 
>> of the rulebase will help prevent the potential of problems, and I 
>> guess this has the possibility of being caused by a bit of corrupted 
>> data, though that's probably reaching.
>
>
> There were no substantive changes from the beta to the production 
> version. Largely just a removal of monitoring code.
>
>> Again, regardless if there was a blip, Sniffer still does a wonderful 
>> job of tagging lots and lots of E-mail, just not quite as much as the 
>> day before.
>
>
> Last night I was able to adjust the rule strength analysis window back 
> to it's original settings. About 5 days of data were lost - but those 
> days will be recovered quickly. Please let me know if this adjustment 
> improved your conditions.
>
> I've noted that on a number of other lists there seem to be posts 
> about a sudden increase in spam over the past few days. We are 
> definitely seeing this also - approximately a 25% or more increase in 
> new rule additions in the past 4 days:
>
> http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp 
>
> Specifically note from about 4 days ago...
>
>
>Days Ago Adjustments
> 

Re: [sniffer] Possible blip?

2004-05-21 Thread Matt




Pete,

Our Hold range has returned to more normal territory on Thursday. 
Here's the stats from the week as a whole on what has been very
consistent traffic.  Out of all E-mail processed, both good and bad,
the %Hold represents what scored between 10-24 points on our system and
needed review, the %Sniffer represents all Sniffer hits except for
Gray, the %Spam is what we scanned and didn't deliver (generally about
99.8% of spam is caught at a score of 10 which this is based on), and
the Sniffer/Spam is the percentage of Sniffer hits as a portion of
messages scoring 10 or more.

    Day      %Hold    %Sniffer    %Spam    Sniffer/Spam
    Mon:     1.86%     77.27% 80.37% 96.14%
    Tue:     2.83%     74.53% 79.37% 93.39%
    Wed:     2.13% 77.60% 79.66% 97.41%
    Thur:    1.95%     76.50% 80.66% 94.84%

The only change that we made to our system was to add two smaller
domains later in the week, and we introduced filters for Cyrillic and
Chinese languages on Wednesday morning which have cut our hold file
down by 0.38 percentage points on Thursday, which explains how our
%Hold is lower on than on Wednesday with a lower Sniffer hit rate on
spam.

I did note two high volume untagged static spammers on Tuesday that we
blacklisted locally, and that combined with the increase in Sniffer
change rates (spam storm) might account for the changes that I saw.  I
am wondering though about the recommendations that you have made for
possibly fine tuning our rule base.  Again though, please keep in mind
that I still feel that performance is overall very, very good.

One of my thoughts regarding minimum rule strengths and grace periods
is that all groups aren't necessarily the same.  For instance Nigerian
scams are low volume and sporadic, and my system performs the worst on
these things.  Maybe lower rule strengths and longer grace periods
makes much more sense for the Phishing category than it does for many
other categories for instance.  Is that possible?

I also looked up the rule strengths on your site and found that about
50%, or maybe more, have a strength below 1, and maybe lowering that is
worth testing out so long as I don't massively increase the number of
records.  I do think though that I would like to test out extending the
grace period.  Most of my false positives are not on things that this
would affect, and that might give niche sources a little extra coverage
if I understand things correctly.

I'll follow your directions and contact you directly regarding any
affirmative changes, but I thought it might be beneficial to keep this
discussion public since some other stats hounds might find this
information to be of use :)

If you can glean anything from the numbers that I gave you, please add
your thoughts.

Thanks,

Matt





Pete McNeil wrote:
At
05:00 PM 5/19/2004, you wrote:
  

  
  I haven't yet upgraded
to the
most recent release, I'm still on the prior beta.  I'll probably do
that this evening.  I tend to wait on upgrades until there has been
enough time for bugs to surface unless I am already looking for a
fix.  I'm sure that the extra verification of the rulebase will help
prevent the potential of problems, and I guess this has the possibility
of being caused by a bit of corrupted data, though that's probably
reaching.
  
There were no substantive changes from the beta to the production
version. Largely just a removal of monitoring code.
  
  Again, regardless if
there was a
blip, Sniffer still does a wonderful job of tagging lots and lots of
E-mail, just not quite as much as the day before.
  
Last night I was able to adjust the rule strength analysis window back
to
it's original settings. About 5 days of data were lost - but those days
will be recovered quickly. Please let me know if this adjustment
improved
your conditions.
  
I've noted that on a number of other lists there seem to be posts about
a
sudden increase in spam over the past few days. We are definitely
seeing
this also - approximately a 25% or more increase in new rule additions
in
the past 4 days:
  
  http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp
  
Specifically note from about 4 days ago...
  
  
  Days Ago Adjustments
 ---

0    356
1    508
2    391
3    410
4    410
5    326
6    309
7    371
8    292
9    347
10   309

  
  ( 5-10 : 1954/6 -> 325.67, 0-5 : 2075/5 -> 415, 325.67/415
-> 78.47 ) 
Note that day 0 is not complete. So applying a "fudge factor"
78.4 _looks like_ 75%.
Besides, 92% of statistics are made up on the spot anyway %^b
  I think a number of things are combined here... I just want to
get a
good handle on them and make sure we are doing the best we can.
  
I've noted, Matt, that your rulebase tuning parameters are set at the
defaults. If you would like to adjust these to be more aggressive then
please let me know off list (support@). More aggressive settings will
keep more rules active in yo

RE: [sniffer] Possible blip?

2004-05-20 Thread Pete McNeil



At 06:38 PM 5/20/2004, you wrote:
Crew,
 
I reposrted this speed issue
before, but despite very intensive debugging and testing, we have not
found an external cause (meaning: not sniffer) for the following:


 
But, now comes the big mystery:
when persistent mode is ON, it takes a lot more time to execute (while
max polling is only 50ms!)
 
0,"2004-05-20
23:48:41",md5845373.msg,827,812,15,0,0,0,3607,1
0,"2004-05-20
23:48:52",md5845374.msg,842,812,0,0,0,0,3833,1
0,"2004-05-20
23:51:15",md5845375.msg,936,874,0,0,0,0,9560,1
0,"2004-05-20
23:51:35",md5845376.msg,889,859,15,0,0,0,26387,0
0,"2004-05-20
23:53:21",md5845377.msg,937,922,0,15,0,15,1922,0
 
Which averages at 850 ms! While
I expected 45 + 25 ms (to compensate for average waiting time) = 70
ms!
 
Pete, could you please check
why this is happening (particularly in code OUTSIDE what's measured and
logged)? I you can't find anything, I'll ask my collegue to come up with
a timing program, which I would like to release on this list so other ppl
can check how long it really takes to execute sniffer (measured from 'the
outside').
As I recall when this last came up the solution turned out to be an
on-access virus scanner that was introducing the extra delays. Turning
off and/or adjusting the on-access virus scanner solved the timing
problem.
The theory goes that the MDaemon CF is single threaded so when Sniffer
runs normally there will only be one instance at once, and as a result
each instance loads it's own rulebase and scans it's own message... this
results in two file reads and no write operations.
With the persistent sniffer instance running as a server, there are
several additional file creation, write, and access events per message.
Each causes the on-demand scnner to intervene and thereby introduce the
additional timing delays. The "transparent" way on-access virus
scanners interfere with file operations accounts for the odd placement of
the additional time.
... as I said, in theory ...
Hope this helps,
_M




RE: [sniffer] Possible blip?

2004-05-20 Thread Michiel Prins



Crew,
 
I reposrted this speed issue before, but despite very 
intensive debugging and testing, we have not found an external cause (meaning: 
not sniffer) for the following:
 
When I use sniffer without the persisten flag, I get 
this log:
 
h0t861s4 20040520214718 md5845369.msg 125 16 Clean 0 0 0 2844 40h0t861s4 20040520214718 md5845370.msg 110 15 Clean 0 0 0 2747 36h0t861s4 20040520214804 md5845371.msg 109 16 Match 109406 62 43 93 43h0t861s4 20040520214804 md5845371.msg 109 16 Match 115560 58 2286 2307 43h0t861s4 20040520214804 md5845371.msg 109 16 Final 115560 58 0 3580 43h0t861s4 20040520214825 md5845372.msg 110 15 Match 29048 52 2757 2788 46h0t861s4 20040520214825 md5845372.msg 110 15 Match 122523 52 2930 2942 46h0t861s4 20040520214825 md5845372.msg 110 15 Match 122017 52 2968 2977 46h0t861s4 20040520214825 md5845372.msg 110 15 Match 122016 52 3346 3355 46h0t861s4 20040520214825 md5845372.msg 110 15 Final 29048 52 0 5504 46
 
which 
looks good (total execution time about 125ms)
 
When I 
have a persistent version running (max 50 ms polling time), I 
get:
 
h0t861s4 20040520214841 md5845373.msg 0 16 Clean 0 0 0 3597 53h0t861s4 20040520214852 md5845374.msg 16 31 Match 119377 62 684 741 38h0t861s4 20040520214852 md5845374.msg 16 31 Final 119377 62 0 3810 38h0t861s4 20040520215115 md5845375.msg 0 31 Match 29081 63 2413 2432 44h0t861s4 20040520215115 md5845375.msg 0 31 Final 29081 63 0 9458 44h0t861s4 20040520215134 md5845376.msg 0 94 Clean 0 0 0 24370 42h0t861s4 20040520215320 md5845377.msg 47 15 Clean 0 0 0 1945 35
Which 
are very good exec times (average 45 ms). 
 
We 
have created our own program that does lots of spam checking for messages. At 
some point, it fires Sniffer. We log the time it takes for Sniffer to run, for 
statistical purposes. When sniffer is NOT persistent, I get the following log 
snippet (same messages as 1st sniffer log above, the second number after the 
.msg is the time it takes for sniffer to run):
 
0,"2004-05-20 
23:47:18",md5845369.msg,172,157,0,15,15,0,43406,20,"2004-05-20 
23:47:18",md5845370.msg,172,156,16,0,0,0,43309,20,"2004-05-20 
23:48:04",md5845371.msg,188,172,0,15,0,15,3578,10,"2004-05-20 
23:48:25",md5845372.msg,186,156,14,0,0,0,5572,1
Average time to run sniffer is 160 ms (sniffer said 125 ms). That means, 
sniffer can't report about 35 ms which is normal for application 
startup and shutdown (also the log is written _after_ the exec time calculation 
has been made, file operations also take time).
 
But, 
now comes the big mystery: when persistent mode is ON, it takes a lot more time 
to execute (while max polling is only 50ms!)
 
0,"2004-05-20 
23:48:41",md5845373.msg,827,812,15,0,0,0,3607,10,"2004-05-20 
23:48:52",md5845374.msg,842,812,0,0,0,0,3833,10,"2004-05-20 
23:51:15",md5845375.msg,936,874,0,0,0,0,9560,10,"2004-05-20 
23:51:35",md5845376.msg,889,859,15,0,0,0,26387,00,"2004-05-20 
23:53:21",md5845377.msg,937,922,0,15,0,15,1922,0
 
Which 
averages at 850 ms! While I expected 45 + 25 ms (to compensate for average 
waiting time) = 70 ms!
 
Pete, 
could you please check why this is happening (particularly in code OUTSIDE 
what's measured and logged)? I you can't find anything, I'll ask my collegue to 
come up with a timing program, which I would like to release on this list so 
other ppl can check how long it really takes to execute sniffer (measured from 
'the outside').
 
Regards,

 
ing. Michiel Prins
SOS Small Office 
Solutions / REJECT
Wannepad 27
1066 
HW  Amsterdam
tel. 020-4082627
fax. 020-4082628

[EMAIL PROTECTED]
 


 Spamvrije zakelijke 
e-mail? reject.nl!

Consultancy - Installation - Maintenance
Network Security  
-   Project Management
Software Development 
- Internet - E-mail



Re: [sniffer] Possible blip?

2004-05-20 Thread Pete McNeil


At 05:00 PM 5/19/2004, you wrote:

I haven't yet upgraded to the
most recent release, I'm still on the prior beta.  I'll probably do
that this evening.  I tend to wait on upgrades until there has been
enough time for bugs to surface unless I am already looking for a
fix.  I'm sure that the extra verification of the rulebase will help
prevent the potential of problems, and I guess this has the possibility
of being caused by a bit of corrupted data, though that's probably
reaching.
There were no substantive changes from the beta to the production
version. Largely just a removal of monitoring code.
Again, regardless if there was a
blip, Sniffer still does a wonderful job of tagging lots and lots of
E-mail, just not quite as much as the day before.
Last night I was able to adjust the rule strength analysis window back to
it's original settings. About 5 days of data were lost - but those days
will be recovered quickly. Please let me know if this adjustment improved
your conditions.
I've noted that on a number of other lists there seem to be posts about a
sudden increase in spam over the past few days. We are definitely seeing
this also - approximately a 25% or more increase in new rule additions in
the past 4 days:
http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp
Specifically note from about 4 days ago...

Days Ago Adjustments
 ---

0    356
1    508
2    391
3    410
4    410
5    326
6    309
7    371
8    292
9    347
10   309

( 5-10 : 1954/6 -> 325.67, 0-5 : 2075/5 -> 415, 325.67/415
-> 78.47 ) 
Note that day 0 is not complete. So applying a "fudge factor"
78.4 _looks like_ 75%.
Besides, 92% of statistics are made up on the spot anyway %^b
I think a number of things are combined here... I just want to get a
good handle on them and make sure we are doing the best we can.
I've noted, Matt, that your rulebase tuning parameters are set at the
defaults. If you would like to adjust these to be more aggressive then
please let me know off list (support@). More aggressive settings will
keep more rules active in your rulebase at lower strengths and will also
allow new rules more time to gain strength before being evaluated.
Respectively the current defaults are:
Minimum Rule Strength: 1.0
Grace Period: 5 days.
Adjusting these settings can significantly increase the size of your
rulebase file.
Best,
_M




Re: [sniffer] Possible blip?

2004-05-19 Thread Matt
Pete,
I was judging based on the size of our Hold range which scores from 
10-24.  On Monday that was 1.86% of total traffic, but on Tuesday that 
was 2.83%.  Message volume was hardly different.  Other notables were 
that on Monday, Sniffer hit 77.27% of all E-mail but on Tuesday it hit 
74.53% (both exclude Gray hits).  Our overall spam percentage is about 
82% on Monday and 81% on Tuesday.  I did also see a drop in XBL hits 
which are primarily zombies from 38.14% to 34.93%.  I've always found 
static spammers to be much more problematic because they lack many 
spammy patterns, and it could be that there was a wave of them that came 
online yesterday which could account for the difference.

I don't want to make a huge deal out of this, but I noted the drop in 
size from one rulebase to another and thought that might be significant, 
and I like to be aware of what is going on.  In reality though the 
difference in percentages in our Hold file meant manually reviewing 50% 
more E-mails, or about 500 extra messages.  With everything else 
consistent, I figured it was worth a post just to check.

I do recall an old posting where you indicated that you were going to 
drop the expiration down to 5 days under a certain number of hits.  My 
thought there is that while it does present some savings in processing, 
it might make more sense to do a 7-8 day expiration in order to help 
catch spammers that are on weekly schedules, primarily lower volume 
niche spammers.  Unfortunately I can't compare my current results 
accurately to the pre-change data because the makeup of my traffic has 
changed significantly over that time frame.

Another possibility is that our Chinese language spam might have been 
extra heavy.  I've brought in much more of that recently from a couple 
different clients and it regularly scores low, probably because it's 
difficult to determine if most of it is spam.  I do know that Sniffer 
doesn't do nearly as well with this stuff.  I've noticed that these guys 
are spamming mostly during Chinese business hours, and they might have 
been extra light on Monday due to the lag in hours coming from a 
weekend.  If you are interested in getting these caught messages 
forwarded to you in an automated fashion for study or for potential 
inclusion, just let me know.  I also have a filter set up for Russian 
language E-mail, but it is not nearly as high in volume (now).

Regarding when I saw the changes in the rule base, I was pulling an 
all-nighter for server administration and noticed this around 5 a.m. 
when I ran the stats program on my Declude logs.  The renamed 'old' 
rulebase was just over 4 MB while the active one was 4.7 MB, then at 
about noon I noticed it was about 4.3 MB, and now it's back up over 4.7 
MB (1,000 KB = 1 MB in these stats if that matters).

I haven't yet upgraded to the most recent release, I'm still on the 
prior beta.  I'll probably do that this evening.  I tend to wait on 
upgrades until there has been enough time for bugs to surface unless I 
am already looking for a fix.  I'm sure that the extra verification of 
the rulebase will help prevent the potential of problems, and I guess 
this has the possibility of being caused by a bit of corrupted data, 
though that's probably reaching.

Again, regardless if there was a blip, Sniffer still does a wonderful 
job of tagging lots and lots of E-mail, just not quite as much as the 
day before.

Thanks,
Matt

Pete McNeil wrote:
At 12:57 PM 5/19/2004, you wrote:
Pete,
I noted late last night that my rulebase grew by 700 KB over the size 
of the previous one that was archived on my machine, and also the 
hits for some of the tests were noticeably lower and I had a definite 
increase in the number of messages that scored in my Hold range 
(instead of scoring higher and landing in Drop).  This morning though 
the size of my rulebase again dropped by about 450 KB.

I was just wondering if this might have been a hiccup with a bad 
compilation or maybe you were testing something out?

We didn't have anything under test that would alter the rulebases. I'm 
going to dig through the logs and see if there's anything I can identify.

If the rulebase was corrupted in any way you would have been able to 
detect that with the latest snf2check utility.

It's not unusual for ruelbase sizes to change by as much as 20%. The 
system is constantly activating and deactivating rules based on new 
log files that are reported. Currently a significant change might 
occur once per day - though we are working on new analysis engines 
that will permit more frequent rule strength adjustments.

For example, we might add 300-900 rules over the course of a day - 
then have that many (or more) removed when the new rule strength 
numbers are calculated.

Another factor that impacts rulebase size is the content of the rules. 
The folding process is not deterministic so it is possible for a few 
rule changes to significantly alter the way the rulebase file is 
folde

Re: [sniffer] Possible blip?

2004-05-19 Thread Pete McNeil
At 12:57 PM 5/19/2004, you wrote:
Pete,
I noted late last night that my rulebase grew by 700 KB over the size of 
the previous one that was archived on my machine, and also the hits for 
some of the tests were noticeably lower and I had a definite increase in 
the number of messages that scored in my Hold range (instead of scoring 
higher and landing in Drop).  This morning though the size of my rulebase 
again dropped by about 450 KB.

I was just wondering if this might have been a hiccup with a bad 
compilation or maybe you were testing something out?
We didn't have anything under test that would alter the rulebases. I'm 
going to dig through the logs and see if there's anything I can identify.

If the rulebase was corrupted in any way you would have been able to detect 
that with the latest snf2check utility.

It's not unusual for ruelbase sizes to change by as much as 20%. The system 
is constantly activating and deactivating rules based on new log files that 
are reported. Currently a significant change might occur once per day - 
though we are working on new analysis engines that will permit more 
frequent rule strength adjustments.

For example, we might add 300-900 rules over the course of a day - then 
have that many (or more) removed when the new rule strength numbers are 
calculated.

Another factor that impacts rulebase size is the content of the rules. The 
folding process is not deterministic so it is possible for a few rule 
changes to significantly alter the way the rulebase file is folded. This is 
less likely to be the change but it is possible.

What was the date on the archive you used to compare sizes?
_M
This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html


[sniffer] Possible blip?

2004-05-19 Thread Matt
Pete,
I noted late last night that my rulebase grew by 700 KB over the size of 
the previous one that was archived on my machine, and also the hits for 
some of the tests were noticeably lower and I had a definite increase in 
the number of messages that scored in my Hold range (instead of scoring 
higher and landing in Drop).  This morning though the size of my 
rulebase again dropped by about 450 KB.

I was just wondering if this might have been a hiccup with a bad 
compilation or maybe you were testing something out?

Thanks,
Matt
--
=
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=

This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html