Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread Thibault Le Meur


Le 04/04/2013 13:23, Ludovic Marcotte a écrit :

On 04/04/13 05:00, Thibault Le Meur wrote:

I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b.
Rebootting the server was the only way to get back to a working SOGo. 

You just loose evidences by doing this.


Yes I know, but sometimes getting quickly back in a working mode is more 
important :( It's the balance between resolving problems and incidents ;-)



There are two possibilities for SOGo using 100% CPU:

 1. the *parent* process is trying to find a free child and all of
them are busy because of slow subsystems (LDAP, database, IMAP
server, SMTP server or even remote HTTP servers for remote ICS
subscriptions). If all children are busy, the parent process will
spin so quickly it'll consume 100% CPU, appearing stuck, while it
isn't ;
 2. a *child* process has gone awry because of a broken subsystem or a
bug in the code. Most of the time, it's due to unhandled IMAP
"traffic" (abrupt connection close due to server bugs, broken
server responses, broken mails not passed by correctly by the
server, etc.). The IMAP code should be more resilient to this, but
sope-mime is just horrible, and should eventually be replaced by
the much cleaner Pantomime framework.

1. can be tuned quite easily, by carefully increasing the workers limit.


Thanks for the advice. I've seen in this list the following proposals:
sudo -u sogo defaults write sogod SxVMemLimit 1024
sudo -u sogo defaults write sogod WOWorkersCount 32

I just don't feel like this kind of setup is really safe... couldn't it 
result in a very high memory consumption mode ?




2. is a bug. When it happens, simply attach to the *child* process and 
produce a stack trace. Then, file a bug report with all the relevant 
data, including the culprit email message (which can be found in the 
sogo.log file). All of this is documented here: 
http://www.sogo.nu/english/nc/support/faq/article/how-do-i-debug-sogo.html




Okay, I'll try this next time.

Thanks again for your useful tips,
Thibault

--
users@sogo.nu
https://inverse.ca/sogo/lists

Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mourik jan heupink

Hi Ludovic, list,


1. can be tuned quite easily, by carefully increasing the workers limit.
Following suggestions from the list, I have increased workers to 32, so 
this is (not likely) the issue anymore, I guess.



2. is a bug. When it happens, simply attach to the *child* process and
produce a stack trace. Then, file a bug report with all the relevant
data, including the culprit email message (which can be found in the
sogo.log file). All of this is documented here:
http://www.sogo.nu/english/nc/support/faq/article/how-do-i-debug-sogo.html
I'm running sogo now again the 'normal' way, not under gdb. I guess I 
need to run it again under gdb.


I did that, and provided (what I think were) two backtraces earlier 
today. But I guess now, those were not what are needed...?

The faq article talks about back trace, not stack trace?

Anyway, thanks for the response!

MJ
--
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread Ludovic Marcotte

On 04/04/13 05:00, Thibault Le Meur wrote:

I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b.
Rebootting the server was the only way to get back to a working SOGo. 
You just loose evidences by doing this. There are two possibilities for 
SOGo using 100% CPU:


1. the *parent* process is trying to find a free child and all of them
   are busy because of slow subsystems (LDAP, database, IMAP server,
   SMTP server or even remote HTTP servers for remote ICS
   subscriptions). If all children are busy, the parent process will
   spin so quickly it'll consume 100% CPU, appearing stuck, while it
   isn't ;
2. a *child* process has gone awry because of a broken subsystem or a
   bug in the code. Most of the time, it's due to unhandled IMAP
   "traffic" (abrupt connection close due to server bugs, broken server
   responses, broken mails not passed by correctly by the server,
   etc.). The IMAP code should be more resilient to this, but sope-mime
   is just horrible, and should eventually be replaced by the much
   cleaner Pantomime framework.

1. can be tuned quite easily, by carefully increasing the workers limit.

2. is a bug. When it happens, simply attach to the *child* process and 
produce a stack trace. Then, file a bug report with all the relevant 
data, including the culprit email message (which can be found in the 
sogo.log file). All of this is documented here: 
http://www.sogo.nu/english/nc/support/faq/article/how-do-i-debug-sogo.html


Thanks,

--
Ludovic Marcotte
+1.514.755.3630  ::  www.inverse.ca
Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence 
(www.packetfence.org)

--
users@sogo.nu
https://inverse.ca/sogo/lists

Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mourik jan heupink

Hi Jan-Frode and the others that have replied,

Your suggestions have helpt us enormously. I have increased SxVMemLimit 
WOWorkersCount (PREFORK), and this made sogo much more responsive 
generally. So far: no lockups, but this has only been running half an 
hour or so.


I have also increased the max_connections to our mysql database server, 
and will shortly also put your script on the server.


For now, life seems better again... But I don't want to cheer too early...

Thanks for your valuable input so far!

MJ
--
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread Jan-Frode Myklebust
On Thu, Apr 04, 2013 at 11:48:46AM +0200, Jan-Frode Myklebust wrote:
> Probably also good to enable some debugging with:
> 
>   sudo -u sogo defaults write sogod SOGoDebugRequests YES
> 
> and see if the sogod.log tell you something..

We've often seen problems with sogod processes getting stuck, eating
cpu, so we've implemented a watchdog that kills sogod-processes
that's been using too much cputime.

Every 5 minutes we run the following script:


8<-8<--8<---8<---8<8<---8<--8<-8<--
#! /bin/sh -
#
# Kill sogo-processes that's been running too long.

too_long=15 # 00-59 minutes

ps -u sogo -opid,ppid,cputime | grep -v PPID | while read pid ppid time
do
# Don't kill main daemon.
if test "x$ppid" != "x1"
then
minutes=$(echo $time | cut -d: -f2)
if test $minutes -gt $too_long;
then
echo Killing $pid
ps -fp $pid
kill -9 $pid
fi
fi

done
8<-8<--8<---8<---8<8<---8<--8<-8<--

This hasn't been triggering often with sogo v2, but we've had situations
earlier where sogod would get stuck on unexpected data from the IMAP
server. F.ex. sogod didn't like dovecot telling it the progress during
IMAP searches and got stuck using 100% cpu whenever that happened.


  -jf
-- 
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread Jan-Frode Myklebust
On Thu, Apr 04, 2013 at 11:40:43AM +0200, mayak-cq wrote:
> 
> sudo -u sogo defaults write sogod WOWorkersCount 32

Please remember to also increase the number of connections to your
postgres database when changing the number of workers.

postgresql max_connections > 3x WOWorkersCount

  -jf
-- 
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread Jan-Frode Myklebust
Probably also good to enable some debugging with:

sudo -u sogo defaults write sogod SOGoDebugRequests YES

and see if the sogod.log tell you something..



   -jf
-- 
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mourik jan heupink

Hi mayak-cq,

I have just done that. Let's hope it will help...

Thank you!

MJ



hi jan,

i'm not an expert, but i think that you could try the following, and see
if it helps:

sudo -u sogo defaults write sogod SxVMemLimit 1024
sudo -u sogo defaults write sogod WOWorkersCount 32

and restart the sogod afterwards.

thanks

m

--
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mayak-cq
On Thu, 2013-04-04 at 11:22 +0200, mourik jan heupink wrote:

> It just happened again, so rebooting the server does not solve the 
> issues here. :-(
> 
> I have no idea where to start troubleshooting this. I hope someone can 
> help, fortunately our old system is still online, so we can go back 
> quite easily. :-(
> 




hi jan,

i'm not an expert, but i think that you could try the following, and see
if it helps:

sudo -u sogo defaults write sogod SxVMemLimit 1024
sudo -u sogo defaults write sogod WOWorkersCount 32

and restart the sogod afterwards.

thanks

m
-- 
users@sogo.nu
https://inverse.ca/sogo/lists

Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mourik jan heupink
It just happened again, so rebooting the server does not solve the 
issues here. :-(


I have no idea where to start troubleshooting this. I hope someone can 
help, fortunately our old system is still online, so we can go back 
quite easily. :-(


On 4/4/2013 11:00 AM, Thibault Le Meur wrote:

I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b.
Rebootting the server was the only way to get back to a working SOGo.

However I'm not able to find out what is occuring to reach such a state.

Regards,
Thibault


Le 04/04/2013 10:29, mourik jan heupink a écrit :

Perhaps some more info:

debian wheezy, x64, sogo 2.0.4b from the repository, enough diskspace,
enough cpu, enough memory.

Actually during the backtrace from below we DID see
> The proxy server received an invalid response from an upstream server.
> The proxy server could not handle the request GET /SOGo.
however, cpu usage was not 100 at that time.

I hope someone here can help us out...

On 4/4/2013 10:24 AM, mourik jan heupink wrote:

Hi all,

During testing sogo behaved perfectly, and yesterday evening we went
live, and since my users are logging on, we are seeing 100% cpu usage,
and I need to manually kill sogod etc, and the clients see:

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /SOGo.

I have attached to the running sogod process inside gdb, according to
the instructions from the faq, and this is what I see:

(gdb) bt
#0  0x7f1cd38cc7fa in sigsuspend () from
/lib/x86_64-linux-gnu/libc.so.6
#1  0x004aecde in ?? ()
#2  0x004af6f5 in ?? ()
#3  0x004b780a in ?? ()
#4  0x005e05b6 in target_wait ()
#5  0x005a1594 in wait_for_inferior ()
#6  0x005a0979 in proceed ()
#7  0x00599438 in continue_1 ()
#8  0x005996af in continue_command ()
#9  0x004e4755 in ?? ()
#10 0x004e7794 in cmd_func ()
#11 0x0045ecbd in execute_command ()
#12 0x005bee23 in ?? ()
#13 0x005bf40e in ?? ()
#14 0x7f1cd52817fb in rl_callback_read_char () from
/lib/x86_64-linux-gnu/libreadline.so.6
#15 0x005be959 in ?? ()
#16 0x005bed35 in stdin_event_handler ()
#17 0x005bd8f3 in ?? ()
#18 0x005bcd9c in ?? ()
#19 0x005bce63 in gdb_do_one_event ()
#20 0x005bceb4 in start_event_loop ()
#21 0x005be983 in cli_command_loop ()
#22 0x005b770c in current_interp_command_loop ()
#23 0x00453ddb in ?? ()
#24 0x005b6f09 in catch_errors ()
#25 0x00454ea2 in ?? ()
#26 0x005b6f09 in catch_errors ()
#27 0x00454ed8 in gdb_main ()
#28 0x00453aea in main ()

Can anyone help me?? This situation is not so nice... :-)

Kind regards,
Mourik Jan



--
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mourik jan heupink

Hi Thibault, list,

I have just rebooted my server, let hope that that solves things more 
permanently?


Rebooting your server solved your issues permanently..?

I hope someone here knows what the provided backtraces mean...

On 4/4/2013 11:00 AM, Thibault Le Meur wrote:

I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b.
Rebootting the server was the only way to get back to a working SOGo.

However I'm not able to find out what is occuring to reach such a state.

Regards,
Thibault

--
users@sogo.nu
https://inverse.ca/sogo/lists


Re: [SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread Thibault Le Meur

I've had the same issue sometimes on squeeze, x64, SOGo 2.0.4b.
Rebootting the server was the only way to get back to a working SOGo.

However I'm not able to find out what is occuring to reach such a state.

Regards,
Thibault


Le 04/04/2013 10:29, mourik jan heupink a écrit :

Perhaps some more info:

debian wheezy, x64, sogo 2.0.4b from the repository, enough diskspace, 
enough cpu, enough memory.


Actually during the backtrace from below we DID see
> The proxy server received an invalid response from an upstream server.
> The proxy server could not handle the request GET /SOGo.
however, cpu usage was not 100 at that time.

I hope someone here can help us out...

On 4/4/2013 10:24 AM, mourik jan heupink wrote:

Hi all,

During testing sogo behaved perfectly, and yesterday evening we went
live, and since my users are logging on, we are seeing 100% cpu usage,
and I need to manually kill sogod etc, and the clients see:

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /SOGo.

I have attached to the running sogod process inside gdb, according to
the instructions from the faq, and this is what I see:

(gdb) bt
#0  0x7f1cd38cc7fa in sigsuspend () from
/lib/x86_64-linux-gnu/libc.so.6
#1  0x004aecde in ?? ()
#2  0x004af6f5 in ?? ()
#3  0x004b780a in ?? ()
#4  0x005e05b6 in target_wait ()
#5  0x005a1594 in wait_for_inferior ()
#6  0x005a0979 in proceed ()
#7  0x00599438 in continue_1 ()
#8  0x005996af in continue_command ()
#9  0x004e4755 in ?? ()
#10 0x004e7794 in cmd_func ()
#11 0x0045ecbd in execute_command ()
#12 0x005bee23 in ?? ()
#13 0x005bf40e in ?? ()
#14 0x7f1cd52817fb in rl_callback_read_char () from
/lib/x86_64-linux-gnu/libreadline.so.6
#15 0x005be959 in ?? ()
#16 0x005bed35 in stdin_event_handler ()
#17 0x005bd8f3 in ?? ()
#18 0x005bcd9c in ?? ()
#19 0x005bce63 in gdb_do_one_event ()
#20 0x005bceb4 in start_event_loop ()
#21 0x005be983 in cli_command_loop ()
#22 0x005b770c in current_interp_command_loop ()
#23 0x00453ddb in ?? ()
#24 0x005b6f09 in catch_errors ()
#25 0x00454ea2 in ?? ()
#26 0x005b6f09 in catch_errors ()
#27 0x00454ed8 in gdb_main ()
#28 0x00453aea in main ()

Can anyone help me?? This situation is not so nice... :-)

Kind regards,
Mourik Jan


--
users@sogo.nu
https://inverse.ca/sogo/lists


[SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mourik jan heupink

Another backtrace while SOGo no longer responds:

#0  0x7f8090fe07fa in sigsuspend () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x004aecde in ?? ()
#2  0x004af6f5 in ?? ()
#3  0x004b780a in ?? ()
#4  0x005e05b6 in target_wait ()
#5  0x005a1594 in wait_for_inferior ()
#6  0x005a0979 in proceed ()
#7  0x005991c8 in ?? ()
#8  0x00599203 in ?? ()
#9  0x004e4755 in ?? ()
#10 0x004e7794 in cmd_func ()
#11 0x0045ecbd in execute_command ()
#12 0x005bee23 in ?? ()
#13 0x005bf40e in ?? ()
#14 0x7f80929957fb in rl_callback_read_char () from 
/lib/x86_64-linux-gnu/li

#15 0x005be959 in ?? ()
#16 0x005bed35 in stdin_event_handler ()
#17 0x005bd8f3 in ?? ()
#18 0x005bcd9c in ?? ()
#19 0x005bce63 in gdb_do_one_event ()
#20 0x005bceb4 in start_event_loop ()
#21 0x005be983 in cli_command_loop ()
#22 0x005b770c in current_interp_command_loop ()
#23 0x00453ddb in ?? ()
#24 0x005b6f09 in catch_errors ()
#25 0x00454ea2 in ?? ()
#26 0x005b6f09 in catch_errors ()
#27 0x00454ed8 in gdb_main ()
#28 0x00453aea in main ()
(gdb) ^CQuit

--
users@sogo.nu
https://inverse.ca/sogo/lists


[SOGo] Re: just gone live with sogo, and keep getting 100% cpu usage... :-(

2013-04-04 Thread mourik jan heupink

Perhaps some more info:

debian wheezy, x64, sogo 2.0.4b from the repository, enough diskspace, 
enough cpu, enough memory.


Actually during the backtrace from below we DID see
> The proxy server received an invalid response from an upstream server.
> The proxy server could not handle the request GET /SOGo.
however, cpu usage was not 100 at that time.

I hope someone here can help us out...

On 4/4/2013 10:24 AM, mourik jan heupink wrote:

Hi all,

During testing sogo behaved perfectly, and yesterday evening we went
live, and since my users are logging on, we are seeing 100% cpu usage,
and I need to manually kill sogod etc, and the clients see:

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /SOGo.

I have attached to the running sogod process inside gdb, according to
the instructions from the faq, and this is what I see:

(gdb) bt
#0  0x7f1cd38cc7fa in sigsuspend () from
/lib/x86_64-linux-gnu/libc.so.6
#1  0x004aecde in ?? ()
#2  0x004af6f5 in ?? ()
#3  0x004b780a in ?? ()
#4  0x005e05b6 in target_wait ()
#5  0x005a1594 in wait_for_inferior ()
#6  0x005a0979 in proceed ()
#7  0x00599438 in continue_1 ()
#8  0x005996af in continue_command ()
#9  0x004e4755 in ?? ()
#10 0x004e7794 in cmd_func ()
#11 0x0045ecbd in execute_command ()
#12 0x005bee23 in ?? ()
#13 0x005bf40e in ?? ()
#14 0x7f1cd52817fb in rl_callback_read_char () from
/lib/x86_64-linux-gnu/libreadline.so.6
#15 0x005be959 in ?? ()
#16 0x005bed35 in stdin_event_handler ()
#17 0x005bd8f3 in ?? ()
#18 0x005bcd9c in ?? ()
#19 0x005bce63 in gdb_do_one_event ()
#20 0x005bceb4 in start_event_loop ()
#21 0x005be983 in cli_command_loop ()
#22 0x005b770c in current_interp_command_loop ()
#23 0x00453ddb in ?? ()
#24 0x005b6f09 in catch_errors ()
#25 0x00454ea2 in ?? ()
#26 0x005b6f09 in catch_errors ()
#27 0x00454ed8 in gdb_main ()
#28 0x00453aea in main ()

Can anyone help me?? This situation is not so nice... :-)

Kind regards,
Mourik Jan

--
users@sogo.nu
https://inverse.ca/sogo/lists