Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Mon, Dec 21, 2009 at 3:36 PM, Martin Langhoff wrote: > On Mon, Dec 21, 2009 at 3:32 PM, Martin Langhoff > wrote: >> I've added a big lock around the process, so from now on Moodle >> processes won't overlap in this sync. This means that your server is >> now running a lightly patched Moodle -- I will release this as a new >> rpm soon. > > Filed as http://dev.laptop.org/ticket/9922 - And fixed. There is now a fixed moodle-xs rpm, so yum --enablerepo=olpcxs-testing update moodle-xs will give you the Moodle with the relevant fix. And I have posted to ejabberd list as well (as you've probably seen). cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Mon, Dec 21, 2009 at 4:18 PM, crodas wrote: > According to the ticket the solution is a locking file which prevent the > re-execution. Well, I've wrote a dumb script awhile ago that might help, > it's not innovating, but it might help: Thanks! The lock I coded is using a moodle-specific bit of code, so it's in PHP, using moodle internal calls, and it uses PostgreSQL's atomicity to achieve it in a non-racey way. This has the advantage of locking a smaller bit of code -- not the whole 'cron' run. BTW, the lock script you're using is a bit racey (things could definitely happen after the if [ -f $LOCK ]). If you want to do this safely in shell scripting, using dotlockfile which is widely available, even in oldish Linuxes. Recent Linuxes all include /usr/bin/flock which is also sanely atomic, so that's what we use on the XS for this task... cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
Hello, According to the ticket the solution is a locking file which prevent the re-execution. Well, I've wrote a dumb script awhile ago that might help, it's not innovating, but it might help: #!/bin/bash LOCK=/tmp/erlang.lock CMD=$1 if [ -f $LOCK ] then PID=`cat $LOCK` UP=`ps $PID | wc -l` if [ $UP -gt 1 ] then exit; fi fi echo $$ > $LOCK $CMD If it worth add to the repository, please let me know and I will submit a patch. Cheers, On Mon, 21 Dec 2009 15:36:50 +0100, Martin Langhoff wrote: > On Mon, Dec 21, 2009 at 3:32 PM, Martin Langhoff > wrote: >> I've added a big lock around the process, so from now on Moodle >> processes won't overlap in this sync. This means that your server is >> now running a lightly patched Moodle -- I will release this as a new >> rpm soon. > > Filed as http://dev.laptop.org/ticket/9922 - > > cheers, > > > > m > -- > martin.langh...@gmail.com > mar...@laptop.org -- School Server Architect > - ask interesting questions > - don't get distracted with shiny stuff - working code first > - http://wiki.laptop.org/go/User:Martinlanghoff > ___ > Server-devel mailing list > Server-devel@lists.laptop.org > http://lists.laptop.org/listinfo/server-devel ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
Ok then. Thanks a lot for the assistance. Things seem to be back to normal. I will look closer tomorrow when the kids are here. ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Mon, Dec 21, 2009 at 3:32 PM, Martin Langhoff wrote: > I've added a big lock around the process, so from now on Moodle > processes won't overlap in this sync. This means that your server is > now running a lightly patched Moodle -- I will release this as a new > rpm soon. Filed as http://dev.laptop.org/ticket/9922 - cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Mon, Dec 21, 2009 at 3:14 PM, Martin Langhoff wrote: > Now it's up on a pristine state, and I am monitoring it... Ok - the problem seems related to Moodle's control of ejabberd presence service. The sync between Moodle and ejabberd data (in mnesia) was taking too long, and a second Moodle sync process would start... and then a 3rd... and then... This led to errors that should be benign (an error reported in the logs, but not leading to a functional problem) -- because ejabberd's internals are all about supporting things that happen concurrently. But! something inside ejabberd isn't liking the concurrency. I've added a big lock around the process, so from now on Moodle processes won't overlap in this sync. This means that your server is now running a lightly patched Moodle -- I will release this as a new rpm soon. According to ps_mem.py, beam started at 14MB and now grown to 16MB, this is with no users connected. In normal operation (once users connect), I would expect it to grow to ~40MB. cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Sun, Dec 20, 2009 at 12:57 PM, Martin Langhoff wrote: > Yep, I am interested in getting to the bottom of this. I think I have an initial assessment of the situation. Clearly, the mnesia DB got corrupted somehow. Because of that... - the init script did cannot stop ejabberd normally... - killall -9 beam kills the beam processes, which get restarted right away (such is the magic of erlang's engine "failsafe design") by epmd... - Moodle's cronjob talks to ejabberd every 5 minutes. When ejabberd is broken, you get a pileup of php scripts trying to run ejabberdctl again and again. so your attempts to follow my instructions (stop ejabberd, remove corrupt DB, start it again) didn't succeed. Now it's up on a pristine state, and I am monitoring it... m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Sat, Dec 19, 2009 at 7:32 PM, Devon Connolly wrote: > >> - Is there any disk anomaly? (Reboot forcing a fsck?) > > Not that I've noticed. Ok, but can you try doing a reboot that forces fsck? As follows: touch /forcefsck reboot or shutdown -Fr now > Verify checked out on the ejabberd-xs package. There might be something with the erlang binaries? > There isn't much sense in reposting the results of the script, as the > results are essentially the same. As ejabberd is crashing, I cannot kill > it to reapply the domain change. I can set you up an ssh account so you > can get a look at what is going on. Perhaps you will see something I am > overlooking. Let me know and I will send you the info. Yep, I am interested in getting to the bottom of this. You'll see a private email from me soon. cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
> - Is there any disk anomaly? (Reboot forcing a fsck?) Not that I've noticed. > > - Is there any problem in the binaries? If you run rpm with the > 'verify' options, it'll check that no binaries have been corrupted > on-disk... It's normal to see some config files changed, but no > binaries should be different from the rpms. Verify checked out on the ejabberd-xs package. There isn't much sense in reposting the results of the script, as the results are essentially the same. As ejabberd is crashing, I cannot kill it to reapply the domain change. I can set you up an ssh account so you can get a look at what is going on. Perhaps you will see something I am overlooking. Let me know and I will send you the info. ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Sat, Dec 19, 2009 at 1:31 PM, Devon Connolly wrote: > Changing the domain, I still get the following error when it tries (and > fails to shutdown ejabberd). As it doesn't stop cleanly, shut down ejabberd by hand, kill -9 it if needed, and then change the domain twice to clear the DB. Then start it up by hand. cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Sat, Dec 19, 2009 at 1:31 PM, Devon Connolly wrote: > Beam is still consuming 100% of the cpu after a few minutes. I'm going to > leave that script running to see what it does over the next few hours. That's really abnormal. - Is there any disk anomaly? (Reboot forcing a fsck?) - Is there any problem in the binaries? If you run rpm with the 'verify' options, it'll check that no binaries have been corrupted on-disk... It's normal to see some config files changed, but no binaries should be different from the rpms. > I imagine I now have to re-register all XO's? Nope. The DB gets rebuilt automagically for you, 100%, on XS-0.6 . cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
Changing the domain, I still get the following error when it tries (and fails to shutdown ejabberd). ___ Crash dump was written to: erl_crash.dump Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}) {error_logger,{{2009,12,19},{12,19,16}},"Protocol: ~p: register error: ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]} {error_logger,{{2009,12,19},{12,19,16}},crash_report,[[{pid,<0.20.0>},{registered_name,net_kernel},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{initial_call,{net_kernel,init,['Argument__1']}},{ancestors,[net_sup,kernel_sup,<0.8.0>]},{messages,[]},{links,[#Port<0.84>,<0.17.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,23},{reductions,505}],[]]} {error_logger,{{2009,12,19},{12,19,16}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[[ejabberdctl,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]} {error_logger,{{2009,12,19},{12,19,16}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]} {error_logger,{{2009,12,19},{12,19,16}},crash_report,[[{pid,<0.7.0>},{registered_name,[]},{error_info,{exit,{shutdown,{kernel,start,[normal,[]]}},[{application_master,init,4},{proc_lib,init_p_do_apply,3}]}},{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{ancestors,[<0.6.0>]},{messages,[{'EXIT',<0.8.0>,normal}]},{links,[<0.6.0>,<0.5.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,233},{stack_size,23},{reductions,123}],[]]} {error_logger,{{2009,12,19},{12,19,16}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]} {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"} Crash dump was written to: erl_crash.dump Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}) __ Beam is still consuming 100% of the cpu after a few minutes. I'm going to leave that script running to see what it does over the next few hours. I imagine I now have to re-register all XO's? On Sat, Dec 19, 2009 at 10:59 AM, Devon Connolly wrote: > > Here is another example after it has been running all night. > > http://pastebin.com/m11537281 > > As you can see, these runaway beam processes vary greatly in there RAM > usage. Also, they are always using 100% of the cpu. > > I will try to clear the DB now and see what happens. > > > > On Fri, Dec 18, 2009 at 12:51 PM, Martin Langhoff < > martin.langh...@gmail.com> wrote: > >> On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly wrote: >> > Anyway, back on topic... Here is that script slightly modified running >> on >> > a fresh boot. I'm going to leave this looping and post the file to >> > pastebin. Here is an initial output after only like 10 minutes. It >> will >> > get more interesting over time. I'll paste another later this >> afternoon. >> >> outrageous. beam should have only ~40MB in use, total. >> >> if you 'clear' the mnesia db as i suggested (keep a copy for >> forensics!), does it get better? >> >> >> >> m >> -- >> martin.langh...@gmail.com >> mar...@laptop.org -- School Server Architect >> - ask interesting questions >> - don't get distracted with shiny stuff - working code first >> - http://wiki.laptop.org/go/User:Martinlanghoff >> > > > ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
[Server-devel] Ejabberd CPU/RAM Spike -> Crashes
Here is another example after it has been running all night. http://pastebin.com/m11537281 As you can see, these runaway beam processes vary greatly in there RAM usage. Also, they are always using 100% of the cpu. I will try to clear the DB now and see what happens. On Fri, Dec 18, 2009 at 12:51 PM, Martin Langhoff wrote: > On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly wrote: > > Anyway, back on topic... Here is that script slightly modified running > on > > a fresh boot. I'm going to leave this looping and post the file to > > pastebin. Here is an initial output after only like 10 minutes. It will > > get more interesting over time. I'll paste another later this afternoon. > > outrageous. beam should have only ~40MB in use, total. > > if you 'clear' the mnesia db as i suggested (keep a copy for > forensics!), does it get better? > > > > m > -- > martin.langh...@gmail.com > mar...@laptop.org -- School Server Architect > - ask interesting questions > - don't get distracted with shiny stuff - working code first > - http://wiki.laptop.org/go/User:Martinlanghoff > ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly wrote: > Anyway, back on topic... Here is that script slightly modified running on > a fresh boot. I'm going to leave this looping and post the file to > pastebin. Here is an initial output after only like 10 minutes. It will > get more interesting over time. I'll paste another later this afternoon. outrageous. beam should have only ~40MB in use, total. if you 'clear' the mnesia db as i suggested (keep a copy for forensics!), does it get better? m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
> Don't reinstall. If possible, let's try to debug this. If you're going > to give up, just > > 1 - Backup /var/lib/ejabberd -- just tar it up > 2 - Use the 'domain_config' script to change the domain -- this will > re-generate the ejabberd mnesia database. What I'd do: change it to > 'foo.com' and then back to the right domain. > I'd like to debug but I only have about a week left here so I need the server to be stable before I leave. I can debug for awhile, but as we approach the holidays, I may need to throw in the table. > I assume you have the different APs in different channels, and > generally avoid channel 1 (as that's where XOs engage in 'mesh' by > default...)... > What we really need is an RF site survey. Unfortunately, there is nobody around that can. They are on different channels but I am forced to use all 3 channels in such a small space. We also have some rude neighbors that decided to amplify their WIFI on channel 6 essentially blanketing the school with interference on that channel. So I have 1 AP on 6, 2 on channel 1, and 2 on channel 11. Anyway, back on topic... Here is that script slightly modified running on a fresh boot. I'm going to leave this looping and post the file to pastebin. Here is an initial output after only like 10 minutes. It will get more interesting over time. I'll paste another later this afternoon. http://pastebin.com/m3426a094 ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Thu, Dec 17, 2009 at 9:32 PM, Devon Connolly wrote: > The server had an uptime of about 50 days before this occurred. There were > no problems and nothing has changed in the 2 or so days since this problem > began. Like had said previously, it seems to have occurred since reflashing > and re-registering a student's XO, but I believe that to be a coincidence. Hmmm, maybe something's gone wonky on the mnesia DB. > We are using 5 wireless AP's. 4 of which are Linksys WRT54G's running > DD-WRT and one is a D-Link modem/AP combo. DHCP is deactivated on all of > the above. Good. >> - Did you also leave XOs running connected to it, or were XOs >> completely disconnected? > > I believe all XO's were disconnected. It is possible some were left > connected while in their charging cabinets, but doubtful. Ok. Then ejabberd is getting messedup all on its own... > Nothing non-standard really. eth0 is fixed. good > Although, this server came > pre-installed from the folks involved with the Give One Get One program in > Rwanda. I'm not sure what was modified from the stock server install. I am > debating reinstalling the server from scratch. Don't reinstall. If possible, let's try to debug this. If you're going to give up, just 1 - Backup /var/lib/ejabberd -- just tar it up 2 - Use the 'domain_config' script to change the domain -- this will re-generate the ejabberd mnesia database. What I'd do: change it to 'foo.com' and then back to the right domain. > I attribute this behavior to the Linksys AP's as they only seem to > handle about 20 connections per AP reliably. yeah. we've seen that plenty. > There is also a good amount of > wireless interference to contend with; however, the server was working > well. I assume you have the different APs in different channels, and generally avoid channel 1 (as that's where XOs engage in 'mesh' by default...)... >>while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd; >>ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done; > > Tried the script at night with the high load, and it cannot complete as the > ejabberd node has since crashed. ejabberdctl yields the following error: Can you restart ejabberd and try that script? > # ps_mem.py | grep ejabberd > > No output Did you download ps_mem.py, and make it executable? (google the name if needed) If so, you might want to grep for erl instead. > I've included a screenshot of htop for your viewing pleasure. > http://omploader.org/vMzBvZQ/htop_screen.jpg ejbabberd sure looks busy there... m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
The server had an uptime of about 50 days before this occurred. There were no problems and nothing has changed in the 2 or so days since this problem began. Like had said previously, it seems to have occurred since reflashing and re-registering a student's XO, but I believe that to be a coincidence. > - Are you perhaps using an AP that does its own DHCP? One way to > check for certain is to connect an XO, and then grep /var/lib/dhcpd/ > (or is it /var/spool/dhcpd/ ?) for the MAC address of the XO We are using 5 wireless AP's. 4 of which are Linksys WRT54G's running DD-WRT and one is a D-Link modem/AP combo. DHCP is deactivated on all of the above. > - Did you also leave XOs running connected to it, or were XOs > completely disconnected? I believe all XO's were disconnected. It is possible some were left connected while in their charging cabinets, but doubtful. >Is there anything else that could be odd or non-standard in your >setup? Are you in a VM? Is eth0 on the XS configured via dhcp with a >short lease? Is there anything in the network between the XOs and the >XS? Nothing non-standard really. eth0 is fixed. Although, this server came pre-installed from the folks involved with the Give One Get One program in Rwanda. I'm not sure what was modified from the stock server install. I am debating reinstalling the server from scratch. I haven't been paying as much attention to the server lately as I should. As it had been running for about 50 days, I only checked in with the school periodically. There were problems but mainly in relation to the presence service and reliably connecting 30 - 100 laptops to the network at one time. I attribute this behavior to the Linksys AP's as they only seem to handle about 20 connections per AP reliably. There is also a good amount of wireless interference to contend with; however, the server was working well. As it is a bit under-powered, load averages generally stay within the 1.2-1.5 range. As I write this, the server has an uptime of about 9 hours. Load averages have reached 25 across the board. The dump files have consumed over a gig of space filling up the root partition. >while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd; >ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done; Tried the script at night with the high load, and it cannot complete as the ejabberd node has since crashed. ejabberdctl yields the following error: _ RPC failed on the node ejabb...@schoolserver: {'EXIT', {badarg, [{ets,lookup, [hooks, {ejabberd_ctl_process, global}]}, {ejabberd_hooks,run_fold,4}, {ejabberd_ctl,process,1}, {rpc, '-handle_call/3-fun-0-', 5}]}} __ Individually issuing the commands: # vmstat Thu Dec 17 20:07:19 UTC 2009 procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 25 0 705768 63912 123132 239040 53 92 153 711 1089 539 61 38 0 1 0 # ps_mem.py | grep ejabberd No output I've included a screenshot of htop for your viewing pleasure. http://omploader.org/vMzBvZQ/htop_screen.jpg I'll give you more relevant info tomorrow. On Thu, Dec 17, 2009 at 12:16 PM, Martin Langhoff wrote: > On Thu, Dec 17, 2009 at 1:12 PM, Martin Langhoff > wrote > > On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly > wrote: > >> XS Version: 0.6 > >> 1 GB Physical Ram, 2GB Swap > > > > Ok - the RAM is on the low side for an XS but should handle 150 ok. > > > >> # ejabberdctl connected-users > > ... > > I counted 12 lines in the output of connected-users. That should not > > cause trouble. > > Also - can you get your hands on ps_mem.py, and run it when the > machine is getting into trouble? I want to correlate the output of > ps_mem.py for ejabberd vs the number of connected users, run something > like this on a console > > while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd; > ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done; > > untested, may need tweaking to work properly. If you run it during the > day and also during the night, will be most interesting. > > cheers, > > > m > -- > martin.langh...@gmail.com > mar...@laptop.org -- School Server Architect > - ask interesting questions > - don't get distracted with shiny stuff - working code first > - http://wiki.laptop.org/go/User:Martinlanghoff > _
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Thu, Dec 17, 2009 at 1:12 PM, Martin Langhoff wrote > On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly wrote: >> XS Version: 0.6 >> 1 GB Physical Ram, 2GB Swap > > Ok - the RAM is on the low side for an XS but should handle 150 ok. > >> # ejabberdctl connected-users > ... > I counted 12 lines in the output of connected-users. That should not > cause trouble. Also - can you get your hands on ps_mem.py, and run it when the machine is getting into trouble? I want to correlate the output of ps_mem.py for ejabberd vs the number of connected users, run something like this on a console while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd; ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done; untested, may need tweaking to work properly. If you run it during the day and also during the night, will be most interesting. cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly wrote: > XS Version: 0.6 > 1 GB Physical Ram, 2GB Swap Ok - the RAM is on the low side for an XS but should handle 150 ok. > # ejabberdctl connected-users ... I counted 12 lines in the output of connected-users. That should not cause trouble. > After leaving it on all night, load averages hit 30 - Did you also leave XOs running connected to it, or were XOs completely disconnected? - Are you perhaps using an AP that does its own DHCP? One way to check for certain is to connect an XO, and then grep /var/lib/dhcpd/ (or is it /var/spool/dhcpd/ ?) for the MAC address of the XO > {error_logger,{{2009,12,17},{10,0,25}},"Protocol: ~p: register error: That crash dump is because it cannot spawn the new thread/process -- there's no hint in it of who/what is hogging them. Seems that ejabberd is consuming all resources (network handles, RAM) over time, even with no usage or very light usage. This is unexpected. We did a lot of load-testing of ejabberd, with many clients connecting, sending msgs, disconnecting over a period of time and we never saw such resource leaks. What we saw was memory usage growing a bit with connects/disconnects, and a GC trimming it down periodically. Memory & cpu use was reasonably stable over time, within that see-saw. Is there anything else that could be odd or non-standard in your setup? Are you in a VM? Is eth0 on the XS configured via dhcp with a short lease? Is there anything in the network between the XOs and the XS? cheers, m -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
XS Version: 0.6 1 GB Physical Ram, 2GB Swap 154 XO's Registered, Any number connected when the problem happens, 0-XX The XS is controlling dhcp but nothing out of the ordinary as far as leases are concerned. No Active Antenna # /home/idmgr/list_registration http://pastebin.com/m762076bb # ejabberdctl stats registeredusers 154 # ejabberdctl connected-users 032a8890f8a9731cfc611580524176a1f8f6c...@schoolserver.notredame.sn/Telepathy 0a0c7fd971cdd25851ba34c9df66ef1845900...@schoolserver.notredame.sn/Telepathy 1c058ff553b654a3d808a3ffe95aadf4de841...@schoolserver.notredame.sn/Telepathy 26b8669a3e9387ac726296de07deced5aaf49...@schoolserver.notredame.sn/Telepathy 2f596cc8d6977519411f5c8fcc65e751e8bd3...@schoolserver.notredame.sn/Telepathy 909785500a4fc5e14fe9f1cd7657e7ac34440...@schoolserver.notredame.sn/Telepathy 9b2102f9af673393c9faa1f3565bd28773f48...@schoolserver.notredame.sn/Telepathy b4e5426593e58970c1b5dafa2adb39e4c3e59...@schoolserver.notredame.sn/Telepathy b7b58f3b01f49c8c652ddaedffd6faeef555b...@schoolserver.notredame.sn/Telepathy efb20aece0870421fc0f3facc58653bdac922...@schoolserver.notredame.sn/Telepathy f9b21026d27589b02b894e221e5531cd1edd1...@schoolserver.notredame.sn/Telepathy # olpc-netstatus //The XO's are using gabble After leaving it on all night, load averages hit 30 It was unresponsive and any calls to ejabberdctl yielded the following error: #ejabberdctl --node ejabb...@schoolserver connected-users __ {error_logger,{{2009,12,17},{10,0,25}},"Protocol: ~p: register error: ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]} {error_logger,{{2009,12,17},{10,0,25}},crash_report,[[{pid,<0.20.0>},{registered_name,net_kernel},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{initial_call,{net_kernel,init,['Argument__1']}},{ancestors,[net_sup,kernel_sup,<0.8.0>]},{messages,[]},{links,[#Port<0.84>,<0.17.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,23},{reductions,506}],[]]} {error_logger,{{2009,12,17},{10,0,25}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[[ejabberdctl,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]} {error_logger,{{2009,12,17},{10,0,25}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]} {error_logger,{{2009,12,17},{10,0,25}},crash_report,[[{pid,<0.7.0>},{registered_name,[]},{error_info,{exit,{shutdown,{kernel,start,[normal,[]]}},[{application_master,init,4},{proc_lib,init_p_do_apply,3}]}},{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{ancestors,[<0.6.0>]},{messages,[{'EXIT',<0.8.0>,normal}]},{links,[<0.6.0>,<0.5.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,233},{stack_size,23},{reductions,123}],[]]} {error_logger,{{2009,12,17},{10,0,26}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]} {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"} Crash dump was written to: erl_crash.dump Kernel pid terminated (application_controller) ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}) ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes
Hi Devon, Sure we can debug this. Lots of questions for you - version of XS? - How much physical RAM? - Number of XOs registered, and in use on the network when the problem happens - Output of the commands suggested in http://wiki.laptop.org/go/XS_Techniques_and_Configuration#Presence_Service_.28ejabberd.29_Troubleshooting - Is there anything in the network that may be forcing lots of dhcpd lease reassigns? Is the XS controlling dhcp for the XOs? - Are you by any chance using our old (and now unsupported) 'Active Antenna' on the XS? cheers, m On Wed, Dec 16, 2009 at 8:28 PM, Devon Connolly wrote: > I'm having some issues with ejabbered after re-flashing and re-registering a > student's XO. No other changes were made to the server; however, the beam > process has begun to constantly use 100% cpu while the ram usage swells to > over 1GB and then proceeds to eat the 2GB swap. This continues until the > load average of the server reaches ~14,14,14 at which time the server > becomes unresponsive. > > Multiple erl crash logs are being created (about 5-10 per minute) in > /var/log/ejabberd. A brief excerpt: > > erl_crash_20091216-124645.dump > _ > =erl_crash_dump:0.1 > Wed Dec 16 12:46:47 2009 > Slogan: Kernel pid terminated (application_controller) > ({application_start_failure, kernel, {shutdown, {kernel, start, [normal, > []]}}}) > System version: Erlang (BEAM) emulator version 5.6.5 [source] > [async-threads:0] [hipe][kernel-poll:false] > > -- > Anyway, each of these crash dump files are thousands of lines. Any ideas > for debugging this? > > Thanks > > ___ > Server-devel mailing list > Server-devel@lists.laptop.org > http://lists.laptop.org/listinfo/server-devel > > -- martin.langh...@gmail.com mar...@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel
[Server-devel] Ejabberd CPU/RAM Spike -> Crashes
I'm having some issues with ejabbered after re-flashing and re-registering a student's XO. No other changes were made to the server; however, the beam process has begun to constantly use 100% cpu while the ram usage swells to over 1GB and then proceeds to eat the 2GB swap. This continues until the load average of the server reaches ~14,14,14 at which time the server becomes unresponsive. Multiple erl crash logs are being created (about 5-10 per minute) in /var/log/ejabberd. A brief excerpt: erl_crash_20091216-124645.dump _ =erl_crash_dump:0.1 Wed Dec 16 12:46:47 2009 Slogan: Kernel pid terminated (application_controller) ({application_start_failure, kernel, {shutdown, {kernel, start, [normal, []]}}}) System version: Erlang (BEAM) emulator version 5.6.5 [source] [async-threads:0] [hipe][kernel-poll:false] -- Anyway, each of these crash dump files are thousands of lines. Any ideas for debugging this? Thanks ___ Server-devel mailing list Server-devel@lists.laptop.org http://lists.laptop.org/listinfo/server-devel