date:20130628

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread Pete McNeil

On 2013-06-27 20:01, Matt wrote:

I'm
attaching a snippet of my log. About 100 lines past the start,
where you see a smattering of error messages, you then see a large
block of them while the Sniffer service is restarting, and then
after that no errors at all. There have in fact been no errors at
all in several hours since this restart of Sniffer.

I can promise you that the error you're reporting comes directly
from a problem with the file system. Here is the code where that
error is generated. Put simply the code tries to open the file and
determine it's size. If that doesn't work it throws the
ERROR_MSG_FILE exception in one of two forms -- that's what ends up
in the log.

try { // Try opening the message file.
MessageFile.open(MessageFilePath.c_str(), ios::in | ios::binary); // Open the file, binary mode.
MessageFile.seekg(0, ios::end); // Find the end of the file,
MessageFileSize = MessageFile.tellg(); // read that position as the size,
MessageFile.seekg(0, ios::beg); // then go back to the beginning.
MyScanData.ScanSize = MessageFileSize; // Capture the message file size.
}
catch(...) {// Trouble? Throw FileError.
MyRulebase-MyLOGmgr.logThisError( // Log the error.
MyScanData, "scanMessageFile().open",
snf_ERROR_MSG_FILE, "ERROR_MSG_FILE"
);
throw FileError("snf_EngineHandler::scanMessageFile() Open/Seek");
}

if(0 = MessageFileSize) { // Handle zero length files.
MessageFile.close();// No need to keep this open.
MyRulebase-MyLOGmgr.logThisError( // Log the error.
MyScanData, "scanMessageFile().isFileEmpty?",
snf_ERROR_MSG_FILE, "ERROR_MSG_FILE"
);
throw FileError("snf_EngineHandler::scanMessageFile() FileEmpty!");
}

Another clue is that in the log snippet you provide, there are hints
of a problem brewing when there are sporadic instances of this
error. Then, when there is a large block -- virtually all requests
to open the files for scan are rejected by the OS. Either something
made those files unavailable, or the OS was unable to handle the
request. I find it interesting also that the time required to report
the error started at about 172 milliseconds and continued to climb
to 406, 578, and then 656 before the restart.

SNF does not make log entries in the classic log during a restart,
by the way.

Note also the timestamps associated with these events and you can
see that the event was precipitated by a dramatic rise in message
rates. The first part of your log seems to indicate about 7-10
messages per second. During the large block of errors, the message
rate appears to have been in excess of 120 (I counted approximately
126 at timestamp 20130627183819). That's an increase at least an
order of magnitude higher than the rate that was causing sporadic
errors.

I suspect based on the data you have provided that something on your
system generated a very large spike of activity that your IO
subsystem was unable to manage and this caused snf scans to fail
because snf was unable to open the files it was asked to scan.

Your restart of SNF apparently coincided with the event, but since
all of the SMD file names are unique during the event, and since SNF
has no way to generate scan requests on it's own, SNF does not
appear to have been the cause of the event in any way. It was able
to record the event, none the less.

So the question in my mind now is:

* Is there a way to improve your IO subsystem so that it can gain
some headroom above 10 msg/sec?
* What caused the sudden dramatic spike that led to this episode?

On most tiny systems I monitor, scan times are generally 100
ms. On your system they are frequently in excess of 400 ms which
leads me to believe your system is a bit underpowered for it's
current load.

Hope this helps,
_M

--
Pete McNeil
Chief Scientist
ARM Research Labs, LLC
www.armresearch.com
866-770-1044 x7010
twitter/codedweller

This message is sent to you because you are subscribed to

the mailing list sniffer@sortmonster.com.

This list is for discussing Message Sniffer,

Anti-spam, Anti-Malware, and related email topics.

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread Matt


Pete,

I'm near positive that it's not system resources that are causing 
Sniffer to not be able to _access the files_.  I believe these errors 
are a symptom and not the cause.


You have to keep in mind that on the messages that don't throw errors, 
they were taking 30-90 seconds to scan, but immediately after a restart 
it was under 1 second.  The system stayed the same, it was just the 
state of the service that was off in a bad way.


I did add a larger client about a month ago around the time that this 
started, which did inch up load by between 1% and 5% I figure, but I 
can't say for sure that the two things are connected.  I've seen much 
bigger changes however in spam volumes from single spammers.  I have 
looked at my SNFclient.exe.err log and found that the previous slowdowns 
were all represented in this file, and nothing else really since a 
smattering in 2012 of other stuff.  I believe that I/O could be the 
trigger, or general system load, but the error in the service that 
misses opening some files, and is otherwise slower than normal by 100 
times, will persist when everything else is fine again.  I figure that 
this is all triggered by a short-term lack of resources or a killer 
message type of issue that does something like run away with memory.  
Certainly there were no recent changes on the server prior to this 
starting to happen, including Sniffer itself which has been perfectly 
solid up until 5/22.


Regarding the ERROR_MSG_FILE batch that I sent you in that log, it did 
happen exactly when I restarted Sniffer, and in fact the 
SNFclient.exe.err log showed a different error while this was happening, 
and maybe this will point you to something else?  That log says Could 
Not Connect! when the regular Sniffer log shows ERROR_MSG_FILE about 
1/8th of the time while in a bad state.  When I restarted the Sniffer 
service, the regular log showed a bunch of ERROR_MSG_FILE in a row, 
but the SNFclient.exe.err log below shows XCI Error!: FileError 
snf_EngineHandler::scanMessageFile() Open/Seek.  You can match the 
message ID's with the other log that I provided.  I believe that block 
of messages was already called to SNFclient.exe, but the Sniffer service 
haddn't yet responded, and so they were dumped as a batch into both logs 
during shut down of the service.


   20130627183807, arg1=F:\\proc\work\D862600e64269.smd : Could Not
   Connect!
   20130627183808, arg1=F:\\proc\work\D86440177431f.smd : Could Not
   Connect!
   20130627183808, arg1=F:\\proc\work\D861200ce41ce.smd : Could Not
   Connect!
   20130627183809, arg1=F:\\proc\work\D864401734321.smd : Could Not
   Connect!
   20130627183809, arg1=F:\\proc\work\D861400da41e3.smd : Could Not
   Connect!
   20130627183810, arg1=F:\\proc\work\D862600d7425f.smd : Could Not
   Connect!
   20130627183811, arg1=F:\\proc\work\D864a00e94346.smd : Could Not
   Connect!
   20130627183811, arg1=F:\\proc\work\D8615019b41f4.smd : Could Not
   Connect!
   20130627183813, arg1=F:\\proc\work\D862900e94282.smd : Could Not
   Connect!
   20130627183815, arg1=F:\\proc\work\D863d01584306.smd : Could Not
   Connect!
   20130627183817, arg1=F:\\proc\work\D86030158416f.smd : Could Not
   Connect!
   20130627183818, arg1=F:\\proc\work\D862300e94255.smd : Could Not
   Connect!
   20130627183819, arg1=F:\\proc\work\D862900e64281.smd : Could Not
   Connect!
   20130627183819, arg1=F:\\proc\work\D864b00d74357.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D864800d7433c.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D861901734205.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D861d01774230.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D8641016d4310.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D865000e64363.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D865000e14361.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D85fe00e64152.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D8610016d41c5.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D861f00ce4231.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D8644018b4320.smd : XCI
   Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
   20130627183819, arg1=F:\\proc\work\D8615017341ef.smd : XCI
   Error!: FileError

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread E. H. (Eric) Fletcher

Matt:

I mentioned in a previous post that we had experienced something similar at
about that time and resolved it a day or so later by re-installing sniffer
when service restarts, reboots and some basic troubleshooting did not give
us the results we needed.  At this point that still seems to have been
effective (about 5 days now).

At the time, we did move things around to see whether it was related to the
number of items in the queue or anywhere else within the structure of the
mail system and found it made no difference. A single item arriving in an
empty Queue was still not processed.   CPU utilization was modest (single
digit across 4 cores) and disk I/O was lighter than usual as it took place
over a weekend.  Memory utilization was a little higher than I'd like to
see, we are addressing that now.

Following a suggestion from another ISP, we moved the spool folders onto a
RAM drive a couple of months ago.  That has worked well for us, we did rule
it out as the source of the problem by moving back onto the conventional
hard disk during the last part of the troubleshooting and for the first hour
or two following the reload.  We are processing on the Ramdisk now and have
been for over 4 days again.

For what it's worth . . .

Eric


-Original Message-
From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf
Of Matt
Sent: Friday, June 28, 2013 10:32 AM
To: Message Sniffer Community
Subject: [sniffer] Re: Slow processing times, errors

Pete,

Just after the restart of the Sniffer service, times dropped back down into
the ms from 30+ seconds before, so what I am saying is that if I/O was the
issue, it was merely the trigger for something that put the service in a bad
state when it started.  I/O issues are not persistent, but could happen from
time to time I'm sure. Restarting Sniffer with a backlog of 2,500 messages
and normal peak traffic will not re-trigger the condition, and I press
Declude to run up to 300 messages at a time in situations like that, and the
CPU's are pegged until the backlog clears.  In the past, I restarted the
whole system, not knowing why it worked.  During normal peak times (without
bursts), the Declude is processing about 125 messages at a time which take
an average of 6 seconds to fully process, and therefore Sniffer is probably
handling only about 10 messages at a time (at peak).

Since 5/22 I have seen 4 or 5 different events like this, and I confirmed
that they are all present in the SNFclient.exe.err log.

Matt



On 6/28/2013 12:41 PM, Pete McNeil wrote:
 On 2013-06-28 12:10, Matt wrote:
 I am looking to retool presently just because it's time.  So if you 
 are convinced that this is due to low resources, don't concern 
 yourself with it.

 Ok. It makes sense that the ~200 messages all at once could have 
 happend at the restart. SNFClient will keep trying for 30-90 seconds 
 before it gives up and spits out it's error file. That's where your 
 delays are coming from. SNF itself was clocking only about 100-800ms 
 for all of the scans.

 The error result you report is exactly the one sent by SNF -- that it 
 was unable to open the file.

 I am very sure this is resource related -- your scans should not be 
 taking the amount of time they are and I suspect most of that time is 
 eaten up trying to get to the files. The occasional errors of the same 
 time are a good hint that IO is to blame.

 The new spam that we've seen often includes large messages -- so 
 that's going to put a higher load on IO resources -- I'll bet that the 
 increased volume and large message sizes are pushing IO over the edge 
 or at least very close to it.

 Best,

 _M




#
This message is sent to you because you are subscribed to
  the mailing list sniffer@sortmonster.com.
This list is for discussing Message Sniffer, Anti-spam, Anti-Malware, and
related email topics.
For More information see http://www.armresearch.com To unsubscribe, E-mail
to: sniffer-...@sortmonster.com To switch to the DIGEST mode, E-mail to
sniffer-dig...@sortmonster.com To switch to the INDEX mode, E-mail to
sniffer-in...@sortmonster.com Send administrative queries to
sniffer-requ...@sortmonster.com



#
This message is sent to you because you are subscribed to
  the mailing list sniffer@sortmonster.com.
This list is for discussing Message Sniffer,
Anti-spam, Anti-Malware, and related email topics.
For More information see http://www.armresearch.com
To unsubscribe, E-mail to: sniffer-...@sortmonster.com
To switch to the DIGEST mode, E-mail to sniffer-dig...@sortmonster.com
To switch to the INDEX mode, E-mail to sniffer-in...@sortmonster.com
Send administrative queries to  sniffer-requ...@sortmonster.com

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread Matt

I'll certainly look more closely next time.  Hopefully I'll be migrated 
before this happens again :)


Matt


On 6/28/2013 1:44 PM, Darin Cox wrote:
How about running performance monitor to watch disk I/O, mem, cpu, 
page file, etc. over time in the hopes of catching one of the events?

Darin.

*From:* Matt mailto:for...@mailpure.com
*Sent:* Friday, June 28, 2013 12:10 PM
*To:* Message Sniffer Community mailto:sniffer@sortmonster.com
*Subject:* [sniffer] Re: Slow processing times, errors
Pete,

I'm near positive that it's not system resources that are causing 
Sniffer to not be able to _access the files_.  I believe these errors 
are a symptom and not the cause.


You have to keep in mind that on the messages that don't throw errors, 
they were taking 30-90 seconds to scan, but immediately after a 
restart it was under 1 second.  The system stayed the same, it was 
just the state of the service that was off in a bad way.


I did add a larger client about a month ago around the time that this 
started, which did inch up load by between 1% and 5% I figure, but I 
can't say for sure that the two things are connected.  I've seen much 
bigger changes however in spam volumes from single spammers.  I have 
looked at my SNFclient.exe.err log and found that the previous 
slowdowns were all represented in this file, and nothing else really 
since a smattering in 2012 of other stuff.  I believe that I/O could 
be the trigger, or general system load, but the error in the service 
that misses opening some files, and is otherwise slower than normal by 
100 times, will persist when everything else is fine again.  I figure 
that this is all triggered by a short-term lack of resources or a 
killer message type of issue that does something like run away with 
memory.  Certainly there were no recent changes on the server prior to 
this starting to happen, including Sniffer itself which has been 
perfectly solid up until 5/22.


Regarding the ERROR_MSG_FILE batch that I sent you in that log, it did 
happen exactly when I restarted Sniffer, and in fact the 
SNFclient.exe.err log showed a different error while this was 
happening, and maybe this will point you to something else?  That log 
says Could Not Connect! when the regular Sniffer log shows 
ERROR_MSG_FILE about 1/8th of the time while in a bad state.  When I 
restarted the Sniffer service, the regular log showed a bunch of 
ERROR_MSG_FILE in a row, but the SNFclient.exe.err log below shows 
XCI Error!: FileError snf_EngineHandler::scanMessageFile() 
Open/Seek.  You can match the message ID's with the other log that I 
provided. I believe that block of messages was already called to 
SNFclient.exe, but the Sniffer service haddn't yet responded, and so 
they were dumped as a batch into both logs during shut down of the 
service.


20130627183807, arg1=F:\\proc\work\D862600e64269.smd : Could
Not Connect!
20130627183808, arg1=F:\\proc\work\D86440177431f.smd : Could
Not Connect!
20130627183808, arg1=F:\\proc\work\D861200ce41ce.smd : Could
Not Connect!
20130627183809, arg1=F:\\proc\work\D864401734321.smd : Could
Not Connect!
20130627183809, arg1=F:\\proc\work\D861400da41e3.smd : Could
Not Connect!
20130627183810, arg1=F:\\proc\work\D862600d7425f.smd : Could
Not Connect!
20130627183811, arg1=F:\\proc\work\D864a00e94346.smd : Could
Not Connect!
20130627183811, arg1=F:\\proc\work\D8615019b41f4.smd : Could
Not Connect!
20130627183813, arg1=F:\\proc\work\D862900e94282.smd : Could
Not Connect!
20130627183815, arg1=F:\\proc\work\D863d01584306.smd : Could
Not Connect!
20130627183817, arg1=F:\\proc\work\D86030158416f.smd : Could
Not Connect!
20130627183818, arg1=F:\\proc\work\D862300e94255.smd : Could
Not Connect!
20130627183819, arg1=F:\\proc\work\D862900e64281.smd : Could
Not Connect!
20130627183819, arg1=F:\\proc\work\D864b00d74357.smd : XCI
Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D864800d7433c.smd : XCI
Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D861901734205.smd : XCI
Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D861d01774230.smd : XCI
Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D8641016d4310.smd : XCI
Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D865000e64363.smd : XCI
Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D865000e14361.smd : XCI
Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D85fe00e64152.smd : XCI
Error!: FileError

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread Matt


Eric,

I'm guessing based on what you were seeing, that it was unrelated to 
what I was seeing.  Sniffer never actually died, it just got over 100 
times slower, and 1/8th of the time it timed out.  This never happened 
before 5/22, and this same server has been there for years, and the same 
installation of Sniffer for 2 years or so.  I would think that if the 
issue was I/O (under normal conditions), it would have happened before 
5/22 as there were clearly bursty periods often enough that my own 
traffic didn't change dramatically enough so that it happened 4 to 5 
times in one month.


The server itself could have some issues that could be causing this.  
Maybe the file system is screwy, or Windows itself, or memory errors, or 
whatever.


Matt


On 6/28/2013 2:12 PM, E. H. (Eric) Fletcher wrote:

Matt:

I mentioned in a previous post that we had experienced something similar at
about that time and resolved it a day or so later by re-installing sniffer
when service restarts, reboots and some basic troubleshooting did not give
us the results we needed.  At this point that still seems to have been
effective (about 5 days now).

At the time, we did move things around to see whether it was related to the
number of items in the queue or anywhere else within the structure of the
mail system and found it made no difference. A single item arriving in an
empty Queue was still not processed.   CPU utilization was modest (single
digit across 4 cores) and disk I/O was lighter than usual as it took place
over a weekend.  Memory utilization was a little higher than I'd like to
see, we are addressing that now.

Following a suggestion from another ISP, we moved the spool folders onto a
RAM drive a couple of months ago.  That has worked well for us, we did rule
it out as the source of the problem by moving back onto the conventional
hard disk during the last part of the troubleshooting and for the first hour
or two following the reload.  We are processing on the Ramdisk now and have
been for over 4 days again.

For what it's worth . . .

Eric


-Original Message-
From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf
Of Matt
Sent: Friday, June 28, 2013 10:32 AM
To: Message Sniffer Community
Subject: [sniffer] Re: Slow processing times, errors

Pete,

Just after the restart of the Sniffer service, times dropped back down into
the ms from 30+ seconds before, so what I am saying is that if I/O was the
issue, it was merely the trigger for something that put the service in a bad
state when it started.  I/O issues are not persistent, but could happen from
time to time I'm sure. Restarting Sniffer with a backlog of 2,500 messages
and normal peak traffic will not re-trigger the condition, and I press
Declude to run up to 300 messages at a time in situations like that, and the
CPU's are pegged until the backlog clears.  In the past, I restarted the
whole system, not knowing why it worked.  During normal peak times (without
bursts), the Declude is processing about 125 messages at a time which take
an average of 6 seconds to fully process, and therefore Sniffer is probably
handling only about 10 messages at a time (at peak).

Since 5/22 I have seen 4 or 5 different events like this, and I confirmed
that they are all present in the SNFclient.exe.err log.

Matt



On 6/28/2013 12:41 PM, Pete McNeil wrote:

On 2013-06-28 12:10, Matt wrote:

I am looking to retool presently just because it's time.  So if you
are convinced that this is due to low resources, don't concern
yourself with it.

Ok. It makes sense that the ~200 messages all at once could have
happend at the restart. SNFClient will keep trying for 30-90 seconds
before it gives up and spits out it's error file. That's where your
delays are coming from. SNF itself was clocking only about 100-800ms
for all of the scans.

The error result you report is exactly the one sent by SNF -- that it
was unable to open the file.

I am very sure this is resource related -- your scans should not be
taking the amount of time they are and I suspect most of that time is
eaten up trying to get to the files. The occasional errors of the same
time are a good hint that IO is to blame.

The new spam that we've seen often includes large messages -- so
that's going to put a higher load on IO resources -- I'll bet that the
increased volume and large message sizes are pushing IO over the edge
or at least very close to it.

Best,

_M




#
This message is sent to you because you are subscribed to
   the mailing list sniffer@sortmonster.com.
This list is for discussing Message Sniffer, Anti-spam, Anti-Malware, and
related email topics.
For More information see http://www.armresearch.com To unsubscribe, E-mail
to: sniffer-...@sortmonster.com To switch to the DIGEST mode, E-mail to
sniffer-dig...@sortmonster.com To switch to the INDEX mode, E-mail to
sniffer-in...@sortmonster.com Send administrative queries to

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread eric

Matt:
Coincidentally (I hope) this happened to us on the 22nd also.  It did not 
stop working completely although we didn't get the throughput you did.  We 
also saw the messages indicating it was not able to open the file. Pretty 
much the same message as in your first post and not one I've seen before.
Eric




Sent using SmarterSync Over-The-Air sync for iPad, iPhone, BlackBerry and 
other SmartPhones.  May use speech to text.  If something seems odd please 
don't hesitate to ask for clarification.  E.O.E.

On 2013-06-28, at 11:39 AM, Matt for...@mailpure.com wrote:

 Eric,
 
 I'm guessing based on what you were seeing, that it was unrelated to what 
I was seeing.  Sniffer never actually died, it just got over 100 times 
slower, and 1/8th of the time it timed out.  This never happened before 
5/22, and this same server has been there for years, and the same 
installation of Sniffer for 2 years or so.  I would think that if the issue 
was I/O (under normal conditions), it would have happened before 5/22 as 
there were clearly bursty periods often enough that my own traffic didn't 
change dramatically enough so that it happened 4 to 5 times in one month.
 
 The server itself could have some issues that could be causing this.  
Maybe the file system is screwy, or Windows itself, or memory errors, or 
whatever.
 
 Matt
 
 
 On 6/28/2013 2:12 PM, E. H. (Eric) Fletcher wrote:
 Matt:
 
 I mentioned in a previous post that we had experienced something similar 
at
 about that time and resolved it a day or so later by re-installing 
sniffer
 when service restarts, reboots and some basic troubleshooting did not 
give
 us the results we needed.  At this point that still seems to have been
 effective (about 5 days now).
 
 At the time, we did move things around to see whether it was related to 
the
 number of items in the queue or anywhere else within the structure of 
the
 mail system and found it made no difference. A single item arriving in 
an
 empty Queue was still not processed.   CPU utilization was modest 
(single
 digit across 4 cores) and disk I/O was lighter than usual as it took 
place
 over a weekend.  Memory utilization was a little higher than I'd like 
to
 see, we are addressing that now.
 
 Following a suggestion from another ISP, we moved the spool folders onto 
a
 RAM drive a couple of months ago.  That has worked well for us, we did 
rule
 it out as the source of the problem by moving back onto the 
conventional
 hard disk during the last part of the troubleshooting and for the first 
hour
 or two following the reload.  We are processing on the Ramdisk now and 
have
 been for over 4 days again.
 
 For what it's worth . . .
 
 Eric
 
 
 -Original Message-
 From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On 
Behalf
 Of Matt
 Sent: Friday, June 28, 2013 10:32 AM
 To: Message Sniffer Community
 Subject: [sniffer] Re: Slow processing times, errors
 
 Pete,
 
 Just after the restart of the Sniffer service, times dropped back down 
into
 the ms from 30+ seconds before, so what I am saying is that if I/O was 
the
 issue, it was merely the trigger for something that put the service in a 
bad
 state when it started.  I/O issues are not persistent, but could happen 
from
 time to time I'm sure. Restarting Sniffer with a backlog of 2,500 
messages
 and normal peak traffic will not re-trigger the condition, and I press
 Declude to run up to 300 messages at a time in situations like that, and 
the
 CPU's are pegged until the backlog clears.  In the past, I restarted 
the
 whole system, not knowing why it worked.  During normal peak times 
(without
 bursts), the Declude is processing about 125 messages at a time which 
take
 an average of 6 seconds to fully process, and therefore Sniffer is 
probably
 handling only about 10 messages at a time (at peak).
 
 Since 5/22 I have seen 4 or 5 different events like this, and I 
confirmed
 that they are all present in the SNFclient.exe.err log.
 
 Matt
 
 
 
 On 6/28/2013 12:41 PM, Pete McNeil wrote:
 On 2013-06-28 12:10, Matt wrote:
 I am looking to retool presently just because it's time.  So if you
 are convinced that this is due to low resources, don't concern
 yourself with it.
 Ok. It makes sense that the ~200 messages all at once could have
 happend at the restart. SNFClient will keep trying for 30-90 seconds
 before it gives up and spits out it's error file. That's where your
 delays are coming from. SNF itself was clocking only about 100-800ms
 for all of the scans.
 
 The error result you report is exactly the one sent by SNF -- that it
 was unable to open the file.
 
 I am very sure this is resource related -- your scans should not be
 taking the amount of time they are and I suspect most of that time is
 eaten up trying to get to the files. The occasional errors of the same
 time are a good hint that IO is to blame.
 
 The new spam that we've seen often includes large messages -- so
 that's going to put a higher load on IO resources

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread E. H. (Eric) Fletcher

I should add that Sniffer has been pretty much trouble free for us.  We have
been using it since before the ARM research days (10+ years as a guess).
One of the specialized clients we host for goes through a cycle every few
years where they are very publically visible and there are a number of
attempts to infect them and do other things to take mail and other services
down as well as huge volumes of SPAM directed a couple of hundred additional
published and very visible email addresses.  For several weeks the mail
volume and bandwidth utilization go up dramatically (10X).  Sniffer is
in-line with other processes and has handled the load without a wrinkle.
Whatever we've seen just recently, and as I reflect on it possibly once
before, is definitely out of the ordinary for us.

 

 

From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf
Of Matt
Sent: Friday, June 28, 2013 11:31 AM
To: Message Sniffer Community
Subject: [sniffer] Re: Slow processing times, errors

 

I'll certainly look more closely next time.  Hopefully I'll be migrated
before this happens again :)

Matt



On 6/28/2013 1:44 PM, Darin Cox wrote:

How about running performance monitor to watch disk I/O, mem, cpu, page
file, etc. over time in the hopes of catching one of the events?

 

Darin.

 

From: Matt mailto:for...@mailpure.com  

Sent: Friday, June 28, 2013 12:10 PM

To: Message Sniffer Community mailto:sniffer@sortmonster.com  

Subject: [sniffer] Re: Slow processing times, errors

 

Pete,

I'm near positive that it's not system resources that are causing Sniffer to
not be able to access the files.  I believe these errors are a symptom and
not the cause.

You have to keep in mind that on the messages that don't throw errors, they
were taking 30-90 seconds to scan, but immediately after a restart it was
under 1 second.  The system stayed the same, it was just the state of the
service that was off in a bad way.

I did add a larger client about a month ago around the time that this
started, which did inch up load by between 1% and 5% I figure, but I can't
say for sure that the two things are connected.  I've seen much bigger
changes however in spam volumes from single spammers.  I have looked at my
SNFclient.exe.err log and found that the previous slowdowns were all
represented in this file, and nothing else really since a smattering in 2012
of other stuff.  I believe that I/O could be the trigger, or general system
load, but the error in the service that misses opening some files, and is
otherwise slower than normal by 100 times, will persist when everything else
is fine again.  I figure that this is all triggered by a short-term lack of
resources or a killer message type of issue that does something like run
away with memory.  Certainly there were no recent changes on the server
prior to this starting to happen, including Sniffer itself which has been
perfectly solid up until 5/22.

Regarding the ERROR_MSG_FILE batch that I sent you in that log, it did
happen exactly when I restarted Sniffer, and in fact the SNFclient.exe.err
log showed a different error while this was happening, and maybe this will
point you to something else?  That log says Could Not Connect! when the
regular Sniffer log shows ERROR_MSG_FILE about 1/8th of the time while in
a bad state.  When I restarted the Sniffer service, the regular log showed a
bunch of ERROR_MSG_FILE in a row, but the SNFclient.exe.err log below
shows XCI Error!: FileError snf_EngineHandler::scanMessageFile()
Open/Seek.  You can match the message ID's with the other log that I
provided.  I believe that block of messages was already called to
SNFclient.exe, but the Sniffer service haddn't yet responded, and so they
were dumped as a batch into both logs during shut down of the service.

20130627183807, arg1=F:\\proc\work\D862600e64269.smd : Could Not
Connect!
20130627183808, arg1=F:\\proc\work\D86440177431f.smd : Could Not
Connect!
20130627183808, arg1=F:\\proc\work\D861200ce41ce.smd : Could Not
Connect!
20130627183809, arg1=F:\\proc\work\D864401734321.smd : Could Not
Connect!
20130627183809, arg1=F:\\proc\work\D861400da41e3.smd : Could Not
Connect!
20130627183810, arg1=F:\\proc\work\D862600d7425f.smd : Could Not
Connect!
20130627183811, arg1=F:\\proc\work\D864a00e94346.smd : Could Not
Connect!
20130627183811, arg1=F:\\proc\work\D8615019b41f4.smd : Could Not
Connect!
20130627183813, arg1=F:\\proc\work\D862900e94282.smd : Could Not
Connect!
20130627183815, arg1=F:\\proc\work\D863d01584306.smd : Could Not
Connect!
20130627183817, arg1=F:\\proc\work\D86030158416f.smd : Could Not
Connect!
20130627183818, arg1=F:\\proc\work\D862300e94255.smd : Could Not
Connect!
20130627183819, arg1=F:\\proc\work\D862900e64281.smd : Could Not
Connect!
20130627183819, arg1=F:\\proc\work\D864b00d74357.smd : XCI Error!:
FileError snf_EngineHandler::scanMessageFile() Open/Seek
20130627183819, arg1=F:\\proc\work\D864800d7433c.smd : XCI

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread Matt

I just looked through my history of the SNFclient.exe.err log and I was 
a bit off.  There were two small occurrences before 5/22, and both were 
so small they were without a doubt unnoticeable.  Here's my history 
since installation on 4/10/2012 with the number of occurrences of Could 
Not Connect! noted:


   11/19/2012 (949 occurrences) - * No backup in excess of 2 minutes
   occurred
   4/26/2012 (181 occurrences) - * No backup in excess of 2 minutes
   occurred
   5/22 (3,568 occurrences) - * No backup in excess of 2 minutes occurred
   5/31 (13,745 occurrences)
   6/3 (2,630 occurrences) - * No backup in excess of 2 minutes occurred
   6/11 (26,194 occurrences)
   6/13 (28,633 occurrences)
   6/14 (83,342 occurrences)
   6/17 (5,952 occurrences) - * No backup in excess of 2 minutes occurred
   6/21 (10,894 occurrences)
   6/27 (34,959 occurrences)

At peak times, we are probably doing between 30,000 and 40,000 messages 
per hour.  When this hit at peak hour yesterday, the server was backed 
up 2 minutes within 10 minutes of this starting.  If it happens outside 
of business hours, it's likely not seen at all.  The backup was resolved 
within 1 hour using different Declude settings, but Sniffer wasn't 
restarted for 2 hours and 45 minutes after the problem started.  So 
those ~35,000 occurrences were from 165 minutes of email.


After looking through my logs, it doesn't seem that I rebooted hardly 
any of these times.  I tend to not want to take the server down outside 
of rush hour.  I do tweak Declude for rush processing when backed up, so 
Declude is restarted which will stop sending files to Sniffer for a 
period of time.  It seems that Sniffer is mostly healing itself without 
restarting it's service, though maybe the updates will cause a service 
restart/reset?


I also checked and found that we added that larger client on 6/1, and 
outside of that, customer counts have been fairly consistent.


Matt


On 6/28/2013 2:56 PM, e...@protologic.com wrote:

Matt:
Coincidentally (I hope) this happened to us on the 22nd also. It did 
not stop working completely although we didn't get the throughput you 
did. We also saw the messages indicating it was not able to open the 
file. Pretty much the same message as in your first post and not one 
I've seen before.

Eric




Sent using SmarterSync Over-The-Air sync for iPad, iPhone, BlackBerry 
and other SmartPhones. May use speech to text. If something seems odd 
please don't hesitate to ask for clarification. E.O.E.


On 2013-06-28, at 11:39 AM, Matt wrote:

 Eric,

 I'm guessing based on what you were seeing, that it was unrelated to 
what I was seeing. Sniffer never actually died, it just got over 100 
times slower, and 1/8th of the time it timed out. This never happened 
before 5/22, and this same server has been there for years, and the 
same installation of Sniffer for 2 years or so. I would think that if 
the issue was I/O (under normal conditions), it would have happened 
before 5/22 as there were clearly bursty periods often enough that my 
own traffic didn't change dramatically enough so that it happened 4 to 
5 times in one month.


 The server itself could have some issues that could be causing this. 
Maybe the file system is screwy, or Windows itself, or memory errors, 
or whatever.


 Matt


 On 6/28/2013 2:12 PM, E. H. (Eric) Fletcher wrote:
 Matt:

 I mentioned in a previous post that we had experienced something 
similar at
 about that time and resolved it a day or so later by re-installing 
sniffer
 when service restarts, reboots and some basic troubleshooting did 
not give

 us the results we needed. At this point that still seems to have been
 effective (about 5 days now).

 At the time, we did move things around to see whether it was 
related to the
 number of items in the queue or anywhere else within the structure 
of the
 mail system and found it made no difference. A single item arriving 
in an

 empty Queue was still not processed. CPU utilization was modest (single
 digit across 4 cores) and disk I/O was lighter than usual as it 
took place

 over a weekend. Memory utilization was a little higher than I'd like to
 see, we are addressing that now.

 Following a suggestion from another ISP, we moved the spool folders 
onto a
 RAM drive a couple of months ago. That has worked well for us, we 
did rule
 it out as the source of the problem by moving back onto the 
conventional
 hard disk during the last part of the troubleshooting and for the 
first hour
 or two following the reload. We are processing on the Ramdisk now 
and have

 been for over 4 days again.

 For what it's worth . . .

 Eric


 -Original Message-
 From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On 
Behalf

 Of Matt
 Sent: Friday, June 28, 2013 10:32 AM
 To: Message Sniffer Community
 Subject: [sniffer] Re: Slow processing times, errors

 Pete,

 Just after the restart of the Sniffer service, times dropped back 
down into
 the ms from 30+

[sniffer] Re: Slow processing times, errors

2013-06-28 Thread Pete McNeil


On 2013-06-28 16:49, Matt wrote:

maybe the updates will cause a service restart/reset?


Rulebase updates (all updates in fact) happen without restarting 
anything. SNF simply loads the new configuration, validates it, uses it 
for new scans, and when all of the old scans are complete it drops the 
old data. All of this happens without impacting scan operations.


_M

--
Pete McNeil
Chief Scientist
ARM Research Labs, LLC
www.armresearch.com
866-770-1044 x7010
twitter/codedweller


#
This message is sent to you because you are subscribed to
 the mailing list sniffer@sortmonster.com.
This list is for discussing Message Sniffer,
Anti-spam, Anti-Malware, and related email topics.
For More information see http://www.armresearch.com
To unsubscribe, E-mail to: sniffer-...@sortmonster.com
To switch to the DIGEST mode, E-mail to sniffer-dig...@sortmonster.com
To switch to the INDEX mode, E-mail to sniffer-in...@sortmonster.com
Send administrative queries to  sniffer-requ...@sortmonster.com

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

[sniffer] Re: Slow processing times, errors

9 matches

Site Navigation

Mail list logo

Footer information