Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Le 04/04/2013 13:23, Ludovic Marcotte a écrit : On 04/04/13 05:00, Thibault Le Meur wrote: I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b. Rebootting the server was the only way to get back to a working SOGo. You just loose evidences by doing this. Yes I know, but sometimes getting quickly back in a working mode is more important :( It's the balance between resolving problems and incidents ;-) There are two possibilities for SOGo using 100% CPU: 1. the *parent* process is trying to find a free child and all of them are busy because of slow subsystems (LDAP, database, IMAP server, SMTP server or even remote HTTP servers for remote ICS subscriptions). If all children are busy, the parent process will spin so quickly it'll consume 100% CPU, appearing stuck, while it isn't ; 2. a *child* process has gone awry because of a broken subsystem or a bug in the code. Most of the time, it's due to unhandled IMAP "traffic" (abrupt connection close due to server bugs, broken server responses, broken mails not passed by correctly by the server, etc.). The IMAP code should be more resilient to this, but sope-mime is just horrible, and should eventually be replaced by the much cleaner Pantomime framework. 1. can be tuned quite easily, by carefully increasing the workers limit. Thanks for the advice. I've seen in this list the following proposals: sudo -u sogo defaults write sogod SxVMemLimit 1024 sudo -u sogo defaults write sogod WOWorkersCount 32 I just don't feel like this kind of setup is really safe... couldn't it result in a very high memory consumption mode ? 2. is a bug. When it happens, simply attach to the *child* process and produce a stack trace. Then, file a bug report with all the relevant data, including the culprit email message (which can be found in the sogo.log file). All of this is documented here: http://www.sogo.nu/english/nc/support/faq/article/how-do-i-debug-sogo.html Okay, I'll try this next time. Thanks again for your useful tips, Thibault -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Hi Ludovic, list, 1. can be tuned quite easily, by carefully increasing the workers limit. Following suggestions from the list, I have increased workers to 32, so this is (not likely) the issue anymore, I guess. 2. is a bug. When it happens, simply attach to the *child* process and produce a stack trace. Then, file a bug report with all the relevant data, including the culprit email message (which can be found in the sogo.log file). All of this is documented here: http://www.sogo.nu/english/nc/support/faq/article/how-do-i-debug-sogo.html I'm running sogo now again the 'normal' way, not under gdb. I guess I need to run it again under gdb. I did that, and provided (what I think were) two backtraces earlier today. But I guess now, those were not what are needed...? The faq article talks about back trace, not stack trace? Anyway, thanks for the response! MJ -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
On 04/04/13 05:00, Thibault Le Meur wrote: I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b. Rebootting the server was the only way to get back to a working SOGo. You just loose evidences by doing this. There are two possibilities for SOGo using 100% CPU: 1. the *parent* process is trying to find a free child and all of them are busy because of slow subsystems (LDAP, database, IMAP server, SMTP server or even remote HTTP servers for remote ICS subscriptions). If all children are busy, the parent process will spin so quickly it'll consume 100% CPU, appearing stuck, while it isn't ; 2. a *child* process has gone awry because of a broken subsystem or a bug in the code. Most of the time, it's due to unhandled IMAP "traffic" (abrupt connection close due to server bugs, broken server responses, broken mails not passed by correctly by the server, etc.). The IMAP code should be more resilient to this, but sope-mime is just horrible, and should eventually be replaced by the much cleaner Pantomime framework. 1. can be tuned quite easily, by carefully increasing the workers limit. 2. is a bug. When it happens, simply attach to the *child* process and produce a stack trace. Then, file a bug report with all the relevant data, including the culprit email message (which can be found in the sogo.log file). All of this is documented here: http://www.sogo.nu/english/nc/support/faq/article/how-do-i-debug-sogo.html Thanks, -- Ludovic Marcotte +1.514.755.3630 :: www.inverse.ca Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org) -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Hi Jan-Frode and the others that have replied, Your suggestions have helpt us enormously. I have increased SxVMemLimit WOWorkersCount (PREFORK), and this made sogo much more responsive generally. So far: no lockups, but this has only been running half an hour or so. I have also increased the max_connections to our mysql database server, and will shortly also put your script on the server. For now, life seems better again... But I don't want to cheer too early... Thanks for your valuable input so far! MJ -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
On Thu, Apr 04, 2013 at 11:48:46AM +0200, Jan-Frode Myklebust wrote: > Probably also good to enable some debugging with: > > sudo -u sogo defaults write sogod SOGoDebugRequests YES > > and see if the sogod.log tell you something.. We've often seen problems with sogod processes getting stuck, eating cpu, so we've implemented a watchdog that kills sogod-processes that's been using too much cputime. Every 5 minutes we run the following script: 8<-8<--8<---8<---8<8<---8<--8<-8<-- #! /bin/sh - # # Kill sogo-processes that's been running too long. too_long=15 # 00-59 minutes ps -u sogo -opid,ppid,cputime | grep -v PPID | while read pid ppid time do # Don't kill main daemon. if test "x$ppid" != "x1" then minutes=$(echo $time | cut -d: -f2) if test $minutes -gt $too_long; then echo Killing $pid ps -fp $pid kill -9 $pid fi fi done 8<-8<--8<---8<---8<8<---8<--8<-8<-- This hasn't been triggering often with sogo v2, but we've had situations earlier where sogod would get stuck on unexpected data from the IMAP server. F.ex. sogod didn't like dovecot telling it the progress during IMAP searches and got stuck using 100% cpu whenever that happened. -jf -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
On Thu, Apr 04, 2013 at 11:40:43AM +0200, mayak-cq wrote: > > sudo -u sogo defaults write sogod WOWorkersCount 32 Please remember to also increase the number of connections to your postgres database when changing the number of workers. postgresql max_connections > 3x WOWorkersCount -jf -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Probably also good to enable some debugging with: sudo -u sogo defaults write sogod SOGoDebugRequests YES and see if the sogod.log tell you something.. -jf -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Hi mayak-cq, I have just done that. Let's hope it will help... Thank you! MJ hi jan, i'm not an expert, but i think that you could try the following, and see if it helps: sudo -u sogo defaults write sogod SxVMemLimit 1024 sudo -u sogo defaults write sogod WOWorkersCount 32 and restart the sogod afterwards. thanks m -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
On Thu, 2013-04-04 at 11:22 +0200, mourik jan heupink wrote: > It just happened again, so rebooting the server does not solve the > issues here. :-( > > I have no idea where to start troubleshooting this. I hope someone can > help, fortunately our old system is still online, so we can go back > quite easily. :-( > hi jan, i'm not an expert, but i think that you could try the following, and see if it helps: sudo -u sogo defaults write sogod SxVMemLimit 1024 sudo -u sogo defaults write sogod WOWorkersCount 32 and restart the sogod afterwards. thanks m -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
It just happened again, so rebooting the server does not solve the issues here. :-( I have no idea where to start troubleshooting this. I hope someone can help, fortunately our old system is still online, so we can go back quite easily. :-( On 4/4/2013 11:00 AM, Thibault Le Meur wrote: I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b. Rebootting the server was the only way to get back to a working SOGo. However I'm not able to find out what is occuring to reach such a state. Regards, Thibault Le 04/04/2013 10:29, mourik jan heupink a écrit : Perhaps some more info: debian wheezy, x64, sogo 2.0.4b from the repository, enough diskspace, enough cpu, enough memory. Actually during the backtrace from below we DID see > The proxy server received an invalid response from an upstream server. > The proxy server could not handle the request GET /SOGo. however, cpu usage was not 100 at that time. I hope someone here can help us out... On 4/4/2013 10:24 AM, mourik jan heupink wrote: Hi all, During testing sogo behaved perfectly, and yesterday evening we went live, and since my users are logging on, we are seeing 100% cpu usage, and I need to manually kill sogod etc, and the clients see: The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /SOGo. I have attached to the running sogod process inside gdb, according to the instructions from the faq, and this is what I see: (gdb) bt #0 0x7f1cd38cc7fa in sigsuspend () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x004aecde in ?? () #2 0x004af6f5 in ?? () #3 0x004b780a in ?? () #4 0x005e05b6 in target_wait () #5 0x005a1594 in wait_for_inferior () #6 0x005a0979 in proceed () #7 0x00599438 in continue_1 () #8 0x005996af in continue_command () #9 0x004e4755 in ?? () #10 0x004e7794 in cmd_func () #11 0x0045ecbd in execute_command () #12 0x005bee23 in ?? () #13 0x005bf40e in ?? () #14 0x7f1cd52817fb in rl_callback_read_char () from /lib/x86_64-linux-gnu/libreadline.so.6 #15 0x005be959 in ?? () #16 0x005bed35 in stdin_event_handler () #17 0x005bd8f3 in ?? () #18 0x005bcd9c in ?? () #19 0x005bce63 in gdb_do_one_event () #20 0x005bceb4 in start_event_loop () #21 0x005be983 in cli_command_loop () #22 0x005b770c in current_interp_command_loop () #23 0x00453ddb in ?? () #24 0x005b6f09 in catch_errors () #25 0x00454ea2 in ?? () #26 0x005b6f09 in catch_errors () #27 0x00454ed8 in gdb_main () #28 0x00453aea in main () Can anyone help me?? This situation is not so nice... :-) Kind regards, Mourik Jan -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Hi Thibault, list, I have just rebooted my server, let hope that that solves things more permanently? Rebooting your server solved your issues permanently..? I hope someone here knows what the provided backtraces mean... On 4/4/2013 11:00 AM, Thibault Le Meur wrote: I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b. Rebootting the server was the only way to get back to a working SOGo. However I'm not able to find out what is occuring to reach such a state. Regards, Thibault -- users@sogo.nu https://inverse.ca/sogo/lists
Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b. Rebootting the server was the only way to get back to a working SOGo. However I'm not able to find out what is occuring to reach such a state. Regards, Thibault Le 04/04/2013 10:29, mourik jan heupink a écrit : Perhaps some more info: debian wheezy, x64, sogo 2.0.4b from the repository, enough diskspace, enough cpu, enough memory. Actually during the backtrace from below we DID see > The proxy server received an invalid response from an upstream server. > The proxy server could not handle the request GET /SOGo. however, cpu usage was not 100 at that time. I hope someone here can help us out... On 4/4/2013 10:24 AM, mourik jan heupink wrote: Hi all, During testing sogo behaved perfectly, and yesterday evening we went live, and since my users are logging on, we are seeing 100% cpu usage, and I need to manually kill sogod etc, and the clients see: The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /SOGo. I have attached to the running sogod process inside gdb, according to the instructions from the faq, and this is what I see: (gdb) bt #0 0x7f1cd38cc7fa in sigsuspend () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x004aecde in ?? () #2 0x004af6f5 in ?? () #3 0x004b780a in ?? () #4 0x005e05b6 in target_wait () #5 0x005a1594 in wait_for_inferior () #6 0x005a0979 in proceed () #7 0x00599438 in continue_1 () #8 0x005996af in continue_command () #9 0x004e4755 in ?? () #10 0x004e7794 in cmd_func () #11 0x0045ecbd in execute_command () #12 0x005bee23 in ?? () #13 0x005bf40e in ?? () #14 0x7f1cd52817fb in rl_callback_read_char () from /lib/x86_64-linux-gnu/libreadline.so.6 #15 0x005be959 in ?? () #16 0x005bed35 in stdin_event_handler () #17 0x005bd8f3 in ?? () #18 0x005bcd9c in ?? () #19 0x005bce63 in gdb_do_one_event () #20 0x005bceb4 in start_event_loop () #21 0x005be983 in cli_command_loop () #22 0x005b770c in current_interp_command_loop () #23 0x00453ddb in ?? () #24 0x005b6f09 in catch_errors () #25 0x00454ea2 in ?? () #26 0x005b6f09 in catch_errors () #27 0x00454ed8 in gdb_main () #28 0x00453aea in main () Can anyone help me?? This situation is not so nice... :-) Kind regards, Mourik Jan -- users@sogo.nu https://inverse.ca/sogo/lists
[SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Another backtrace while SOGo no longer responds: #0 0x7f8090fe07fa in sigsuspend () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x004aecde in ?? () #2 0x004af6f5 in ?? () #3 0x004b780a in ?? () #4 0x005e05b6 in target_wait () #5 0x005a1594 in wait_for_inferior () #6 0x005a0979 in proceed () #7 0x005991c8 in ?? () #8 0x00599203 in ?? () #9 0x004e4755 in ?? () #10 0x004e7794 in cmd_func () #11 0x0045ecbd in execute_command () #12 0x005bee23 in ?? () #13 0x005bf40e in ?? () #14 0x7f80929957fb in rl_callback_read_char () from /lib/x86_64-linux-gnu/li #15 0x005be959 in ?? () #16 0x005bed35 in stdin_event_handler () #17 0x005bd8f3 in ?? () #18 0x005bcd9c in ?? () #19 0x005bce63 in gdb_do_one_event () #20 0x005bceb4 in start_event_loop () #21 0x005be983 in cli_command_loop () #22 0x005b770c in current_interp_command_loop () #23 0x00453ddb in ?? () #24 0x005b6f09 in catch_errors () #25 0x00454ea2 in ?? () #26 0x005b6f09 in catch_errors () #27 0x00454ed8 in gdb_main () #28 0x00453aea in main () (gdb) ^CQuit -- users@sogo.nu https://inverse.ca/sogo/lists
[SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(
Perhaps some more info: debian wheezy, x64, sogo 2.0.4b from the repository, enough diskspace, enough cpu, enough memory. Actually during the backtrace from below we DID see > The proxy server received an invalid response from an upstream server. > The proxy server could not handle the request GET /SOGo. however, cpu usage was not 100 at that time. I hope someone here can help us out... On 4/4/2013 10:24 AM, mourik jan heupink wrote: Hi all, During testing sogo behaved perfectly, and yesterday evening we went live, and since my users are logging on, we are seeing 100% cpu usage, and I need to manually kill sogod etc, and the clients see: The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /SOGo. I have attached to the running sogod process inside gdb, according to the instructions from the faq, and this is what I see: (gdb) bt #0 0x7f1cd38cc7fa in sigsuspend () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x004aecde in ?? () #2 0x004af6f5 in ?? () #3 0x004b780a in ?? () #4 0x005e05b6 in target_wait () #5 0x005a1594 in wait_for_inferior () #6 0x005a0979 in proceed () #7 0x00599438 in continue_1 () #8 0x005996af in continue_command () #9 0x004e4755 in ?? () #10 0x004e7794 in cmd_func () #11 0x0045ecbd in execute_command () #12 0x005bee23 in ?? () #13 0x005bf40e in ?? () #14 0x7f1cd52817fb in rl_callback_read_char () from /lib/x86_64-linux-gnu/libreadline.so.6 #15 0x005be959 in ?? () #16 0x005bed35 in stdin_event_handler () #17 0x005bd8f3 in ?? () #18 0x005bcd9c in ?? () #19 0x005bce63 in gdb_do_one_event () #20 0x005bceb4 in start_event_loop () #21 0x005be983 in cli_command_loop () #22 0x005b770c in current_interp_command_loop () #23 0x00453ddb in ?? () #24 0x005b6f09 in catch_errors () #25 0x00454ea2 in ?? () #26 0x005b6f09 in catch_errors () #27 0x00454ed8 in gdb_main () #28 0x00453aea in main () Can anyone help me?? This situation is not so nice... :-) Kind regards, Mourik Jan -- users@sogo.nu https://inverse.ca/sogo/lists