[sniffer] Re: Slow processing times, errors
On 2013-06-27 20:01, Matt wrote: I'm attaching a snippet of my log. About 100 lines past the start, where you see a smattering of error messages, you then see a large block of them while the Sniffer service is restarting, and then after that no errors at all. There have in fact been no errors at all in several hours since this restart of Sniffer. I can promise you that the error you're reporting comes directly from a problem with the file system. Here is the code where that error is generated. Put simply the code tries to open the file and determine it's size. If that doesn't work it throws the ERROR_MSG_FILE exception in one of two forms -- that's what ends up in the log. try { // Try opening the message file. MessageFile.open(MessageFilePath.c_str(), ios::in | ios::binary); // Open the file, binary mode. MessageFile.seekg(0, ios::end); // Find the end of the file, MessageFileSize = MessageFile.tellg(); // read that position as the size, MessageFile.seekg(0, ios::beg); // then go back to the beginning. MyScanData.ScanSize = MessageFileSize; // Capture the message file size. } catch(...) {// Trouble? Throw FileError. MyRulebase-MyLOGmgr.logThisError( // Log the error. MyScanData, "scanMessageFile().open", snf_ERROR_MSG_FILE, "ERROR_MSG_FILE" ); throw FileError("snf_EngineHandler::scanMessageFile() Open/Seek"); } if(0 = MessageFileSize) { // Handle zero length files. MessageFile.close();// No need to keep this open. MyRulebase-MyLOGmgr.logThisError( // Log the error. MyScanData, "scanMessageFile().isFileEmpty?", snf_ERROR_MSG_FILE, "ERROR_MSG_FILE" ); throw FileError("snf_EngineHandler::scanMessageFile() FileEmpty!"); } Another clue is that in the log snippet you provide, there are hints of a problem brewing when there are sporadic instances of this error. Then, when there is a large block -- virtually all requests to open the files for scan are rejected by the OS. Either something made those files unavailable, or the OS was unable to handle the request. I find it interesting also that the time required to report the error started at about 172 milliseconds and continued to climb to 406, 578, and then 656 before the restart. SNF does not make log entries in the classic log during a restart, by the way. Note also the timestamps associated with these events and you can see that the event was precipitated by a dramatic rise in message rates. The first part of your log seems to indicate about 7-10 messages per second. During the large block of errors, the message rate appears to have been in excess of 120 (I counted approximately 126 at timestamp 20130627183819). That's an increase at least an order of magnitude higher than the rate that was causing sporadic errors. I suspect based on the data you have provided that something on your system generated a very large spike of activity that your IO subsystem was unable to manage and this caused snf scans to fail because snf was unable to open the files it was asked to scan. Your restart of SNF apparently coincided with the event, but since all of the SMD file names are unique during the event, and since SNF has no way to generate scan requests on it's own, SNF does not appear to have been the cause of the event in any way. It was able to record the event, none the less. So the question in my mind now is: * Is there a way to improve your IO subsystem so that it can gain some headroom above 10 msg/sec? * What caused the sudden dramatic spike that led to this episode? On most tiny systems I monitor, scan times are generally 100 ms. On your system they are frequently in excess of 400 ms which leads me to believe your system is a bit underpowered for it's current load. Hope this helps, _M -- Pete McNeil Chief Scientist ARM Research Labs, LLC www.armresearch.com 866-770-1044 x7010 twitter/codedweller # This message is sent to you because you are subscribed to the mailing list sniffer@sortmonster.com. This list is for discussing Message Sniffer, Anti-spam, Anti-Malware, and related email topics.
[sniffer] Re: Slow processing times, errors
Pete, I'm near positive that it's not system resources that are causing Sniffer to not be able to _access the files_. I believe these errors are a symptom and not the cause. You have to keep in mind that on the messages that don't throw errors, they were taking 30-90 seconds to scan, but immediately after a restart it was under 1 second. The system stayed the same, it was just the state of the service that was off in a bad way. I did add a larger client about a month ago around the time that this started, which did inch up load by between 1% and 5% I figure, but I can't say for sure that the two things are connected. I've seen much bigger changes however in spam volumes from single spammers. I have looked at my SNFclient.exe.err log and found that the previous slowdowns were all represented in this file, and nothing else really since a smattering in 2012 of other stuff. I believe that I/O could be the trigger, or general system load, but the error in the service that misses opening some files, and is otherwise slower than normal by 100 times, will persist when everything else is fine again. I figure that this is all triggered by a short-term lack of resources or a killer message type of issue that does something like run away with memory. Certainly there were no recent changes on the server prior to this starting to happen, including Sniffer itself which has been perfectly solid up until 5/22. Regarding the ERROR_MSG_FILE batch that I sent you in that log, it did happen exactly when I restarted Sniffer, and in fact the SNFclient.exe.err log showed a different error while this was happening, and maybe this will point you to something else? That log says Could Not Connect! when the regular Sniffer log shows ERROR_MSG_FILE about 1/8th of the time while in a bad state. When I restarted the Sniffer service, the regular log showed a bunch of ERROR_MSG_FILE in a row, but the SNFclient.exe.err log below shows XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek. You can match the message ID's with the other log that I provided. I believe that block of messages was already called to SNFclient.exe, but the Sniffer service haddn't yet responded, and so they were dumped as a batch into both logs during shut down of the service. 20130627183807, arg1=F:\\proc\work\D862600e64269.smd : Could Not Connect! 20130627183808, arg1=F:\\proc\work\D86440177431f.smd : Could Not Connect! 20130627183808, arg1=F:\\proc\work\D861200ce41ce.smd : Could Not Connect! 20130627183809, arg1=F:\\proc\work\D864401734321.smd : Could Not Connect! 20130627183809, arg1=F:\\proc\work\D861400da41e3.smd : Could Not Connect! 20130627183810, arg1=F:\\proc\work\D862600d7425f.smd : Could Not Connect! 20130627183811, arg1=F:\\proc\work\D864a00e94346.smd : Could Not Connect! 20130627183811, arg1=F:\\proc\work\D8615019b41f4.smd : Could Not Connect! 20130627183813, arg1=F:\\proc\work\D862900e94282.smd : Could Not Connect! 20130627183815, arg1=F:\\proc\work\D863d01584306.smd : Could Not Connect! 20130627183817, arg1=F:\\proc\work\D86030158416f.smd : Could Not Connect! 20130627183818, arg1=F:\\proc\work\D862300e94255.smd : Could Not Connect! 20130627183819, arg1=F:\\proc\work\D862900e64281.smd : Could Not Connect! 20130627183819, arg1=F:\\proc\work\D864b00d74357.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D864800d7433c.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D861901734205.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D861d01774230.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D8641016d4310.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D865000e64363.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D865000e14361.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D85fe00e64152.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D8610016d41c5.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D861f00ce4231.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D8644018b4320.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D8615017341ef.smd : XCI Error!: FileError
[sniffer] Re: Slow processing times, errors
Matt: I mentioned in a previous post that we had experienced something similar at about that time and resolved it a day or so later by re-installing sniffer when service restarts, reboots and some basic troubleshooting did not give us the results we needed. At this point that still seems to have been effective (about 5 days now). At the time, we did move things around to see whether it was related to the number of items in the queue or anywhere else within the structure of the mail system and found it made no difference. A single item arriving in an empty Queue was still not processed. CPU utilization was modest (single digit across 4 cores) and disk I/O was lighter than usual as it took place over a weekend. Memory utilization was a little higher than I'd like to see, we are addressing that now. Following a suggestion from another ISP, we moved the spool folders onto a RAM drive a couple of months ago. That has worked well for us, we did rule it out as the source of the problem by moving back onto the conventional hard disk during the last part of the troubleshooting and for the first hour or two following the reload. We are processing on the Ramdisk now and have been for over 4 days again. For what it's worth . . . Eric -Original Message- From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf Of Matt Sent: Friday, June 28, 2013 10:32 AM To: Message Sniffer Community Subject: [sniffer] Re: Slow processing times, errors Pete, Just after the restart of the Sniffer service, times dropped back down into the ms from 30+ seconds before, so what I am saying is that if I/O was the issue, it was merely the trigger for something that put the service in a bad state when it started. I/O issues are not persistent, but could happen from time to time I'm sure. Restarting Sniffer with a backlog of 2,500 messages and normal peak traffic will not re-trigger the condition, and I press Declude to run up to 300 messages at a time in situations like that, and the CPU's are pegged until the backlog clears. In the past, I restarted the whole system, not knowing why it worked. During normal peak times (without bursts), the Declude is processing about 125 messages at a time which take an average of 6 seconds to fully process, and therefore Sniffer is probably handling only about 10 messages at a time (at peak). Since 5/22 I have seen 4 or 5 different events like this, and I confirmed that they are all present in the SNFclient.exe.err log. Matt On 6/28/2013 12:41 PM, Pete McNeil wrote: On 2013-06-28 12:10, Matt wrote: I am looking to retool presently just because it's time. So if you are convinced that this is due to low resources, don't concern yourself with it. Ok. It makes sense that the ~200 messages all at once could have happend at the restart. SNFClient will keep trying for 30-90 seconds before it gives up and spits out it's error file. That's where your delays are coming from. SNF itself was clocking only about 100-800ms for all of the scans. The error result you report is exactly the one sent by SNF -- that it was unable to open the file. I am very sure this is resource related -- your scans should not be taking the amount of time they are and I suspect most of that time is eaten up trying to get to the files. The occasional errors of the same time are a good hint that IO is to blame. The new spam that we've seen often includes large messages -- so that's going to put a higher load on IO resources -- I'll bet that the increased volume and large message sizes are pushing IO over the edge or at least very close to it. Best, _M # This message is sent to you because you are subscribed to the mailing list sniffer@sortmonster.com. This list is for discussing Message Sniffer, Anti-spam, Anti-Malware, and related email topics. For More information see http://www.armresearch.com To unsubscribe, E-mail to: sniffer-...@sortmonster.com To switch to the DIGEST mode, E-mail to sniffer-dig...@sortmonster.com To switch to the INDEX mode, E-mail to sniffer-in...@sortmonster.com Send administrative queries to sniffer-requ...@sortmonster.com # This message is sent to you because you are subscribed to the mailing list sniffer@sortmonster.com. This list is for discussing Message Sniffer, Anti-spam, Anti-Malware, and related email topics. For More information see http://www.armresearch.com To unsubscribe, E-mail to: sniffer-...@sortmonster.com To switch to the DIGEST mode, E-mail to sniffer-dig...@sortmonster.com To switch to the INDEX mode, E-mail to sniffer-in...@sortmonster.com Send administrative queries to sniffer-requ...@sortmonster.com
[sniffer] Re: Slow processing times, errors
I'll certainly look more closely next time. Hopefully I'll be migrated before this happens again :) Matt On 6/28/2013 1:44 PM, Darin Cox wrote: How about running performance monitor to watch disk I/O, mem, cpu, page file, etc. over time in the hopes of catching one of the events? Darin. *From:* Matt mailto:for...@mailpure.com *Sent:* Friday, June 28, 2013 12:10 PM *To:* Message Sniffer Community mailto:sniffer@sortmonster.com *Subject:* [sniffer] Re: Slow processing times, errors Pete, I'm near positive that it's not system resources that are causing Sniffer to not be able to _access the files_. I believe these errors are a symptom and not the cause. You have to keep in mind that on the messages that don't throw errors, they were taking 30-90 seconds to scan, but immediately after a restart it was under 1 second. The system stayed the same, it was just the state of the service that was off in a bad way. I did add a larger client about a month ago around the time that this started, which did inch up load by between 1% and 5% I figure, but I can't say for sure that the two things are connected. I've seen much bigger changes however in spam volumes from single spammers. I have looked at my SNFclient.exe.err log and found that the previous slowdowns were all represented in this file, and nothing else really since a smattering in 2012 of other stuff. I believe that I/O could be the trigger, or general system load, but the error in the service that misses opening some files, and is otherwise slower than normal by 100 times, will persist when everything else is fine again. I figure that this is all triggered by a short-term lack of resources or a killer message type of issue that does something like run away with memory. Certainly there were no recent changes on the server prior to this starting to happen, including Sniffer itself which has been perfectly solid up until 5/22. Regarding the ERROR_MSG_FILE batch that I sent you in that log, it did happen exactly when I restarted Sniffer, and in fact the SNFclient.exe.err log showed a different error while this was happening, and maybe this will point you to something else? That log says Could Not Connect! when the regular Sniffer log shows ERROR_MSG_FILE about 1/8th of the time while in a bad state. When I restarted the Sniffer service, the regular log showed a bunch of ERROR_MSG_FILE in a row, but the SNFclient.exe.err log below shows XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek. You can match the message ID's with the other log that I provided. I believe that block of messages was already called to SNFclient.exe, but the Sniffer service haddn't yet responded, and so they were dumped as a batch into both logs during shut down of the service. 20130627183807, arg1=F:\\proc\work\D862600e64269.smd : Could Not Connect! 20130627183808, arg1=F:\\proc\work\D86440177431f.smd : Could Not Connect! 20130627183808, arg1=F:\\proc\work\D861200ce41ce.smd : Could Not Connect! 20130627183809, arg1=F:\\proc\work\D864401734321.smd : Could Not Connect! 20130627183809, arg1=F:\\proc\work\D861400da41e3.smd : Could Not Connect! 20130627183810, arg1=F:\\proc\work\D862600d7425f.smd : Could Not Connect! 20130627183811, arg1=F:\\proc\work\D864a00e94346.smd : Could Not Connect! 20130627183811, arg1=F:\\proc\work\D8615019b41f4.smd : Could Not Connect! 20130627183813, arg1=F:\\proc\work\D862900e94282.smd : Could Not Connect! 20130627183815, arg1=F:\\proc\work\D863d01584306.smd : Could Not Connect! 20130627183817, arg1=F:\\proc\work\D86030158416f.smd : Could Not Connect! 20130627183818, arg1=F:\\proc\work\D862300e94255.smd : Could Not Connect! 20130627183819, arg1=F:\\proc\work\D862900e64281.smd : Could Not Connect! 20130627183819, arg1=F:\\proc\work\D864b00d74357.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D864800d7433c.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D861901734205.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D861d01774230.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D8641016d4310.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D865000e64363.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D865000e14361.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D85fe00e64152.smd : XCI Error!: FileError
[sniffer] Re: Slow processing times, errors
Eric, I'm guessing based on what you were seeing, that it was unrelated to what I was seeing. Sniffer never actually died, it just got over 100 times slower, and 1/8th of the time it timed out. This never happened before 5/22, and this same server has been there for years, and the same installation of Sniffer for 2 years or so. I would think that if the issue was I/O (under normal conditions), it would have happened before 5/22 as there were clearly bursty periods often enough that my own traffic didn't change dramatically enough so that it happened 4 to 5 times in one month. The server itself could have some issues that could be causing this. Maybe the file system is screwy, or Windows itself, or memory errors, or whatever. Matt On 6/28/2013 2:12 PM, E. H. (Eric) Fletcher wrote: Matt: I mentioned in a previous post that we had experienced something similar at about that time and resolved it a day or so later by re-installing sniffer when service restarts, reboots and some basic troubleshooting did not give us the results we needed. At this point that still seems to have been effective (about 5 days now). At the time, we did move things around to see whether it was related to the number of items in the queue or anywhere else within the structure of the mail system and found it made no difference. A single item arriving in an empty Queue was still not processed. CPU utilization was modest (single digit across 4 cores) and disk I/O was lighter than usual as it took place over a weekend. Memory utilization was a little higher than I'd like to see, we are addressing that now. Following a suggestion from another ISP, we moved the spool folders onto a RAM drive a couple of months ago. That has worked well for us, we did rule it out as the source of the problem by moving back onto the conventional hard disk during the last part of the troubleshooting and for the first hour or two following the reload. We are processing on the Ramdisk now and have been for over 4 days again. For what it's worth . . . Eric -Original Message- From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf Of Matt Sent: Friday, June 28, 2013 10:32 AM To: Message Sniffer Community Subject: [sniffer] Re: Slow processing times, errors Pete, Just after the restart of the Sniffer service, times dropped back down into the ms from 30+ seconds before, so what I am saying is that if I/O was the issue, it was merely the trigger for something that put the service in a bad state when it started. I/O issues are not persistent, but could happen from time to time I'm sure. Restarting Sniffer with a backlog of 2,500 messages and normal peak traffic will not re-trigger the condition, and I press Declude to run up to 300 messages at a time in situations like that, and the CPU's are pegged until the backlog clears. In the past, I restarted the whole system, not knowing why it worked. During normal peak times (without bursts), the Declude is processing about 125 messages at a time which take an average of 6 seconds to fully process, and therefore Sniffer is probably handling only about 10 messages at a time (at peak). Since 5/22 I have seen 4 or 5 different events like this, and I confirmed that they are all present in the SNFclient.exe.err log. Matt On 6/28/2013 12:41 PM, Pete McNeil wrote: On 2013-06-28 12:10, Matt wrote: I am looking to retool presently just because it's time. So if you are convinced that this is due to low resources, don't concern yourself with it. Ok. It makes sense that the ~200 messages all at once could have happend at the restart. SNFClient will keep trying for 30-90 seconds before it gives up and spits out it's error file. That's where your delays are coming from. SNF itself was clocking only about 100-800ms for all of the scans. The error result you report is exactly the one sent by SNF -- that it was unable to open the file. I am very sure this is resource related -- your scans should not be taking the amount of time they are and I suspect most of that time is eaten up trying to get to the files. The occasional errors of the same time are a good hint that IO is to blame. The new spam that we've seen often includes large messages -- so that's going to put a higher load on IO resources -- I'll bet that the increased volume and large message sizes are pushing IO over the edge or at least very close to it. Best, _M # This message is sent to you because you are subscribed to the mailing list sniffer@sortmonster.com. This list is for discussing Message Sniffer, Anti-spam, Anti-Malware, and related email topics. For More information see http://www.armresearch.com To unsubscribe, E-mail to: sniffer-...@sortmonster.com To switch to the DIGEST mode, E-mail to sniffer-dig...@sortmonster.com To switch to the INDEX mode, E-mail to sniffer-in...@sortmonster.com Send administrative queries to
[sniffer] Re: Slow processing times, errors
Matt: Coincidentally (I hope) this happened to us on the 22nd also. It did not stop working completely although we didn't get the throughput you did. We also saw the messages indicating it was not able to open the file. Pretty much the same message as in your first post and not one I've seen before. Eric Sent using SmarterSync Over-The-Air sync for iPad, iPhone, BlackBerry and other SmartPhones. May use speech to text. If something seems odd please don't hesitate to ask for clarification. E.O.E. On 2013-06-28, at 11:39 AM, Matt for...@mailpure.com wrote: Eric, I'm guessing based on what you were seeing, that it was unrelated to what I was seeing. Sniffer never actually died, it just got over 100 times slower, and 1/8th of the time it timed out. This never happened before 5/22, and this same server has been there for years, and the same installation of Sniffer for 2 years or so. I would think that if the issue was I/O (under normal conditions), it would have happened before 5/22 as there were clearly bursty periods often enough that my own traffic didn't change dramatically enough so that it happened 4 to 5 times in one month. The server itself could have some issues that could be causing this. Maybe the file system is screwy, or Windows itself, or memory errors, or whatever. Matt On 6/28/2013 2:12 PM, E. H. (Eric) Fletcher wrote: Matt: I mentioned in a previous post that we had experienced something similar at about that time and resolved it a day or so later by re-installing sniffer when service restarts, reboots and some basic troubleshooting did not give us the results we needed. At this point that still seems to have been effective (about 5 days now). At the time, we did move things around to see whether it was related to the number of items in the queue or anywhere else within the structure of the mail system and found it made no difference. A single item arriving in an empty Queue was still not processed. CPU utilization was modest (single digit across 4 cores) and disk I/O was lighter than usual as it took place over a weekend. Memory utilization was a little higher than I'd like to see, we are addressing that now. Following a suggestion from another ISP, we moved the spool folders onto a RAM drive a couple of months ago. That has worked well for us, we did rule it out as the source of the problem by moving back onto the conventional hard disk during the last part of the troubleshooting and for the first hour or two following the reload. We are processing on the Ramdisk now and have been for over 4 days again. For what it's worth . . . Eric -Original Message- From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf Of Matt Sent: Friday, June 28, 2013 10:32 AM To: Message Sniffer Community Subject: [sniffer] Re: Slow processing times, errors Pete, Just after the restart of the Sniffer service, times dropped back down into the ms from 30+ seconds before, so what I am saying is that if I/O was the issue, it was merely the trigger for something that put the service in a bad state when it started. I/O issues are not persistent, but could happen from time to time I'm sure. Restarting Sniffer with a backlog of 2,500 messages and normal peak traffic will not re-trigger the condition, and I press Declude to run up to 300 messages at a time in situations like that, and the CPU's are pegged until the backlog clears. In the past, I restarted the whole system, not knowing why it worked. During normal peak times (without bursts), the Declude is processing about 125 messages at a time which take an average of 6 seconds to fully process, and therefore Sniffer is probably handling only about 10 messages at a time (at peak). Since 5/22 I have seen 4 or 5 different events like this, and I confirmed that they are all present in the SNFclient.exe.err log. Matt On 6/28/2013 12:41 PM, Pete McNeil wrote: On 2013-06-28 12:10, Matt wrote: I am looking to retool presently just because it's time. So if you are convinced that this is due to low resources, don't concern yourself with it. Ok. It makes sense that the ~200 messages all at once could have happend at the restart. SNFClient will keep trying for 30-90 seconds before it gives up and spits out it's error file. That's where your delays are coming from. SNF itself was clocking only about 100-800ms for all of the scans. The error result you report is exactly the one sent by SNF -- that it was unable to open the file. I am very sure this is resource related -- your scans should not be taking the amount of time they are and I suspect most of that time is eaten up trying to get to the files. The occasional errors of the same time are a good hint that IO is to blame. The new spam that we've seen often includes large messages -- so that's going to put a higher load on IO resources
[sniffer] Re: Slow processing times, errors
I should add that Sniffer has been pretty much trouble free for us. We have been using it since before the ARM research days (10+ years as a guess). One of the specialized clients we host for goes through a cycle every few years where they are very publically visible and there are a number of attempts to infect them and do other things to take mail and other services down as well as huge volumes of SPAM directed a couple of hundred additional published and very visible email addresses. For several weeks the mail volume and bandwidth utilization go up dramatically (10X). Sniffer is in-line with other processes and has handled the load without a wrinkle. Whatever we've seen just recently, and as I reflect on it possibly once before, is definitely out of the ordinary for us. From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf Of Matt Sent: Friday, June 28, 2013 11:31 AM To: Message Sniffer Community Subject: [sniffer] Re: Slow processing times, errors I'll certainly look more closely next time. Hopefully I'll be migrated before this happens again :) Matt On 6/28/2013 1:44 PM, Darin Cox wrote: How about running performance monitor to watch disk I/O, mem, cpu, page file, etc. over time in the hopes of catching one of the events? Darin. From: Matt mailto:for...@mailpure.com Sent: Friday, June 28, 2013 12:10 PM To: Message Sniffer Community mailto:sniffer@sortmonster.com Subject: [sniffer] Re: Slow processing times, errors Pete, I'm near positive that it's not system resources that are causing Sniffer to not be able to access the files. I believe these errors are a symptom and not the cause. You have to keep in mind that on the messages that don't throw errors, they were taking 30-90 seconds to scan, but immediately after a restart it was under 1 second. The system stayed the same, it was just the state of the service that was off in a bad way. I did add a larger client about a month ago around the time that this started, which did inch up load by between 1% and 5% I figure, but I can't say for sure that the two things are connected. I've seen much bigger changes however in spam volumes from single spammers. I have looked at my SNFclient.exe.err log and found that the previous slowdowns were all represented in this file, and nothing else really since a smattering in 2012 of other stuff. I believe that I/O could be the trigger, or general system load, but the error in the service that misses opening some files, and is otherwise slower than normal by 100 times, will persist when everything else is fine again. I figure that this is all triggered by a short-term lack of resources or a killer message type of issue that does something like run away with memory. Certainly there were no recent changes on the server prior to this starting to happen, including Sniffer itself which has been perfectly solid up until 5/22. Regarding the ERROR_MSG_FILE batch that I sent you in that log, it did happen exactly when I restarted Sniffer, and in fact the SNFclient.exe.err log showed a different error while this was happening, and maybe this will point you to something else? That log says Could Not Connect! when the regular Sniffer log shows ERROR_MSG_FILE about 1/8th of the time while in a bad state. When I restarted the Sniffer service, the regular log showed a bunch of ERROR_MSG_FILE in a row, but the SNFclient.exe.err log below shows XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek. You can match the message ID's with the other log that I provided. I believe that block of messages was already called to SNFclient.exe, but the Sniffer service haddn't yet responded, and so they were dumped as a batch into both logs during shut down of the service. 20130627183807, arg1=F:\\proc\work\D862600e64269.smd : Could Not Connect! 20130627183808, arg1=F:\\proc\work\D86440177431f.smd : Could Not Connect! 20130627183808, arg1=F:\\proc\work\D861200ce41ce.smd : Could Not Connect! 20130627183809, arg1=F:\\proc\work\D864401734321.smd : Could Not Connect! 20130627183809, arg1=F:\\proc\work\D861400da41e3.smd : Could Not Connect! 20130627183810, arg1=F:\\proc\work\D862600d7425f.smd : Could Not Connect! 20130627183811, arg1=F:\\proc\work\D864a00e94346.smd : Could Not Connect! 20130627183811, arg1=F:\\proc\work\D8615019b41f4.smd : Could Not Connect! 20130627183813, arg1=F:\\proc\work\D862900e94282.smd : Could Not Connect! 20130627183815, arg1=F:\\proc\work\D863d01584306.smd : Could Not Connect! 20130627183817, arg1=F:\\proc\work\D86030158416f.smd : Could Not Connect! 20130627183818, arg1=F:\\proc\work\D862300e94255.smd : Could Not Connect! 20130627183819, arg1=F:\\proc\work\D862900e64281.smd : Could Not Connect! 20130627183819, arg1=F:\\proc\work\D864b00d74357.smd : XCI Error!: FileError snf_EngineHandler::scanMessageFile() Open/Seek 20130627183819, arg1=F:\\proc\work\D864800d7433c.smd : XCI
[sniffer] Re: Slow processing times, errors
I just looked through my history of the SNFclient.exe.err log and I was a bit off. There were two small occurrences before 5/22, and both were so small they were without a doubt unnoticeable. Here's my history since installation on 4/10/2012 with the number of occurrences of Could Not Connect! noted: 11/19/2012 (949 occurrences) - * No backup in excess of 2 minutes occurred 4/26/2012 (181 occurrences) - * No backup in excess of 2 minutes occurred 5/22 (3,568 occurrences) - * No backup in excess of 2 minutes occurred 5/31 (13,745 occurrences) 6/3 (2,630 occurrences) - * No backup in excess of 2 minutes occurred 6/11 (26,194 occurrences) 6/13 (28,633 occurrences) 6/14 (83,342 occurrences) 6/17 (5,952 occurrences) - * No backup in excess of 2 minutes occurred 6/21 (10,894 occurrences) 6/27 (34,959 occurrences) At peak times, we are probably doing between 30,000 and 40,000 messages per hour. When this hit at peak hour yesterday, the server was backed up 2 minutes within 10 minutes of this starting. If it happens outside of business hours, it's likely not seen at all. The backup was resolved within 1 hour using different Declude settings, but Sniffer wasn't restarted for 2 hours and 45 minutes after the problem started. So those ~35,000 occurrences were from 165 minutes of email. After looking through my logs, it doesn't seem that I rebooted hardly any of these times. I tend to not want to take the server down outside of rush hour. I do tweak Declude for rush processing when backed up, so Declude is restarted which will stop sending files to Sniffer for a period of time. It seems that Sniffer is mostly healing itself without restarting it's service, though maybe the updates will cause a service restart/reset? I also checked and found that we added that larger client on 6/1, and outside of that, customer counts have been fairly consistent. Matt On 6/28/2013 2:56 PM, e...@protologic.com wrote: Matt: Coincidentally (I hope) this happened to us on the 22nd also. It did not stop working completely although we didn't get the throughput you did. We also saw the messages indicating it was not able to open the file. Pretty much the same message as in your first post and not one I've seen before. Eric Sent using SmarterSync Over-The-Air sync for iPad, iPhone, BlackBerry and other SmartPhones. May use speech to text. If something seems odd please don't hesitate to ask for clarification. E.O.E. On 2013-06-28, at 11:39 AM, Matt wrote: Eric, I'm guessing based on what you were seeing, that it was unrelated to what I was seeing. Sniffer never actually died, it just got over 100 times slower, and 1/8th of the time it timed out. This never happened before 5/22, and this same server has been there for years, and the same installation of Sniffer for 2 years or so. I would think that if the issue was I/O (under normal conditions), it would have happened before 5/22 as there were clearly bursty periods often enough that my own traffic didn't change dramatically enough so that it happened 4 to 5 times in one month. The server itself could have some issues that could be causing this. Maybe the file system is screwy, or Windows itself, or memory errors, or whatever. Matt On 6/28/2013 2:12 PM, E. H. (Eric) Fletcher wrote: Matt: I mentioned in a previous post that we had experienced something similar at about that time and resolved it a day or so later by re-installing sniffer when service restarts, reboots and some basic troubleshooting did not give us the results we needed. At this point that still seems to have been effective (about 5 days now). At the time, we did move things around to see whether it was related to the number of items in the queue or anywhere else within the structure of the mail system and found it made no difference. A single item arriving in an empty Queue was still not processed. CPU utilization was modest (single digit across 4 cores) and disk I/O was lighter than usual as it took place over a weekend. Memory utilization was a little higher than I'd like to see, we are addressing that now. Following a suggestion from another ISP, we moved the spool folders onto a RAM drive a couple of months ago. That has worked well for us, we did rule it out as the source of the problem by moving back onto the conventional hard disk during the last part of the troubleshooting and for the first hour or two following the reload. We are processing on the Ramdisk now and have been for over 4 days again. For what it's worth . . . Eric -Original Message- From: Message Sniffer Community [mailto:sniffer@sortmonster.com] On Behalf Of Matt Sent: Friday, June 28, 2013 10:32 AM To: Message Sniffer Community Subject: [sniffer] Re: Slow processing times, errors Pete, Just after the restart of the Sniffer service, times dropped back down into the ms from 30+
[sniffer] Re: Slow processing times, errors
On 2013-06-28 16:49, Matt wrote: maybe the updates will cause a service restart/reset? Rulebase updates (all updates in fact) happen without restarting anything. SNF simply loads the new configuration, validates it, uses it for new scans, and when all of the old scans are complete it drops the old data. All of this happens without impacting scan operations. _M -- Pete McNeil Chief Scientist ARM Research Labs, LLC www.armresearch.com 866-770-1044 x7010 twitter/codedweller # This message is sent to you because you are subscribed to the mailing list sniffer@sortmonster.com. This list is for discussing Message Sniffer, Anti-spam, Anti-Malware, and related email topics. For More information see http://www.armresearch.com To unsubscribe, E-mail to: sniffer-...@sortmonster.com To switch to the DIGEST mode, E-mail to sniffer-dig...@sortmonster.com To switch to the INDEX mode, E-mail to sniffer-in...@sortmonster.com Send administrative queries to sniffer-requ...@sortmonster.com