subject:"\[Server\-devel\] Ejabberd CPU\/RAM Spike \- Crashes"

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-28 Thread Martin Langhoff

On Mon, Dec 21, 2009 at 3:36 PM, Martin Langhoff
 wrote:
> On Mon, Dec 21, 2009 at 3:32 PM, Martin Langhoff
>  wrote:
>> I've added a big lock around the process, so from now on Moodle
>> processes won't overlap in this sync. This means that your server is
>> now running a lightly patched Moodle -- I will release this as a new
>> rpm soon.
>
> Filed as http://dev.laptop.org/ticket/9922 -

And fixed. There is now a fixed moodle-xs rpm, so

   yum --enablerepo=olpcxs-testing update moodle-xs

will give you the Moodle with the relevant fix. And I have posted to
ejabberd list as well (as you've probably seen).

cheers,


m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-21 Thread Martin Langhoff

On Mon, Dec 21, 2009 at 4:18 PM, crodas  wrote:
> According to the ticket the solution is a locking file which prevent the
> re-execution. Well, I've wrote a dumb script awhile ago that might help,
> it's not innovating, but it might help:

Thanks! The lock I coded is using a moodle-specific bit of code, so
it's in PHP, using moodle internal calls, and it uses PostgreSQL's
atomicity to achieve it in a non-racey way.

This has the advantage of locking a smaller bit of code -- not the
whole 'cron' run.

BTW, the lock script you're using is a bit racey (things could
definitely happen after the if [ -f $LOCK ]). If you want to do this
safely in shell scripting, using dotlockfile which is widely
available, even in oldish Linuxes. Recent Linuxes all include
/usr/bin/flock which is also sanely atomic, so that's what we use on
the XS for this task...

cheers,

m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-21 Thread crodas

Hello,

According to the ticket the solution is a locking file which prevent the
re-execution. Well, I've wrote a dumb script awhile ago that might help,
it's not innovating, but it might help:

#!/bin/bash

LOCK=/tmp/erlang.lock
CMD=$1
if [ -f $LOCK ]
then
PID=`cat $LOCK`
UP=`ps $PID | wc -l`
if [ $UP -gt 1 ]
then
exit;
fi
fi
echo $$ > $LOCK
$CMD

If it worth add to the repository, please let me know and I will submit a
patch.

Cheers, 

On Mon, 21 Dec 2009 15:36:50 +0100, Martin Langhoff
 wrote:
> On Mon, Dec 21, 2009 at 3:32 PM, Martin Langhoff
>  wrote:
>> I've added a big lock around the process, so from now on Moodle
>> processes won't overlap in this sync. This means that your server is
>> now running a lightly patched Moodle -- I will release this as a new
>> rpm soon.
> 
> Filed as http://dev.laptop.org/ticket/9922 -
> 
> cheers,
> 
> 
> 
> m
> -- 
>  martin.langh...@gmail.com
>  mar...@laptop.org -- School Server Architect
>  - ask interesting questions
>  - don't get distracted with shiny stuff  - working code first
>  - http://wiki.laptop.org/go/User:Martinlanghoff
> ___
> Server-devel mailing list
> Server-devel@lists.laptop.org
> http://lists.laptop.org/listinfo/server-devel
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-21 Thread Devon Connolly

Ok then.  Thanks a lot for the assistance.  Things seem to be back to  
normal.  I will look closer tomorrow when the kids are here.


___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-21 Thread Martin Langhoff

On Mon, Dec 21, 2009 at 3:32 PM, Martin Langhoff
 wrote:
> I've added a big lock around the process, so from now on Moodle
> processes won't overlap in this sync. This means that your server is
> now running a lightly patched Moodle -- I will release this as a new
> rpm soon.

Filed as http://dev.laptop.org/ticket/9922 -

cheers,



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-21 Thread Martin Langhoff

On Mon, Dec 21, 2009 at 3:14 PM, Martin Langhoff
 wrote:
> Now it's up on a pristine state, and I am monitoring it...

Ok - the problem seems related to Moodle's control of ejabberd
presence service. The sync between Moodle and ejabberd data (in
mnesia) was taking too long, and a second Moodle sync process would
start... and then a 3rd... and then...

This led to errors that should be benign (an error reported in the
logs, but not  leading to a functional problem) -- because ejabberd's
internals are all about supporting things that happen concurrently.
But! something inside ejabberd isn't liking the concurrency.

I've added a big lock around the process, so from now on Moodle
processes won't overlap in this sync. This means that your server is
now running a lightly patched Moodle -- I will release this as a new
rpm soon.

According to ps_mem.py, beam started at 14MB and now grown to 16MB,
this is with no users connected. In normal operation (once users
connect), I would expect it to grow to ~40MB.

cheers,


m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-21 Thread Martin Langhoff

On Sun, Dec 20, 2009 at 12:57 PM, Martin Langhoff
 wrote:
> Yep, I am interested in getting to the bottom of this.

I think I have an initial assessment of the situation.

Clearly, the mnesia DB got corrupted somehow. Because of that...

 - the init script did cannot stop ejabberd normally...

 - killall -9 beam kills the beam processes, which get restarted right
away (such is the magic of erlang's engine "failsafe design") by
epmd...

 - Moodle's cronjob talks to ejabberd every 5 minutes. When ejabberd
is broken, you get a pileup of php scripts trying to run ejabberdctl
again and again.

so your attempts to follow my instructions (stop ejabberd, remove
corrupt DB, start it again) didn't succeed.

Now it's up on a pristine state, and I am monitoring it...



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-20 Thread Martin Langhoff

On Sat, Dec 19, 2009 at 7:32 PM, Devon Connolly  wrote:
>
>>  - Is there any disk anomaly? (Reboot forcing a fsck?)
>
> Not that I've noticed.

Ok, but can you try doing a reboot that forces fsck? As follows:

 touch /forcefsck
 reboot

or

  shutdown -Fr now

> Verify checked out on the ejabberd-xs package.

There might be something with the erlang binaries?

> There isn't much sense in reposting the results of the script, as the
> results are essentially the same.  As ejabberd is crashing, I cannot kill
> it to reapply the domain change.  I can set you up an ssh account so you
> can get a look at what is going on.  Perhaps you will see something I am
> overlooking.  Let me know and I will send you the info.

Yep, I am interested in getting to the bottom of this. You'll see a
private email from me soon.

cheers,



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-19 Thread Devon Connolly


>  - Is there any disk anomaly? (Reboot forcing a fsck?)

Not that I've noticed.

>
>  - Is there any problem in the binaries? If you run rpm with the
> 'verify' options, it'll check that no binaries have been corrupted
> on-disk... It's normal to see some config files changed, but no
> binaries should be different from the rpms.

Verify checked out on the ejabberd-xs package.

There isn't much sense in reposting the results of the script, as the  
results are essentially the same.  As ejabberd is crashing, I cannot kill  
it to reapply the domain change.  I can set you up an ssh account so you  
can get a look at what is going on.  Perhaps you will see something I am  
overlooking.  Let me know and I will send you the info.

___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-19 Thread Martin Langhoff

On Sat, Dec 19, 2009 at 1:31 PM, Devon Connolly  wrote:
> Changing the domain, I still get the following error when it tries (and
> fails to shutdown ejabberd).

As it doesn't stop cleanly, shut down ejabberd by hand, kill -9 it if
needed, and then change the domain twice to clear the DB. Then start
it up by hand.

cheers,


m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-19 Thread Martin Langhoff

On Sat, Dec 19, 2009 at 1:31 PM, Devon Connolly  wrote:
> Beam is still consuming 100% of the cpu after a few minutes.  I'm going to
> leave that script running to see what it does over the next few hours.

That's really abnormal.

 - Is there any disk anomaly? (Reboot forcing a fsck?)

 - Is there any problem in the binaries? If you run rpm with the
'verify' options, it'll check that no binaries have been corrupted
on-disk... It's normal to see some config files changed, but no
binaries should be different from the rpms.

> I imagine I now have to re-register all XO's?

Nope. The DB gets rebuilt automagically for you, 100%, on XS-0.6 .

cheers,



m


-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-19 Thread Devon Connolly

Changing the domain, I still get the following error when it tries (and
fails to shutdown ejabberd).

___
Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
{error_logger,{{2009,12,19},{12,19,16}},"Protocol: ~p: register error:
~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}
{error_logger,{{2009,12,19},{12,19,16}},crash_report,[[{pid,<0.20.0>},{registered_name,net_kernel},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{initial_call,{net_kernel,init,['Argument__1']}},{ancestors,[net_sup,kernel_sup,<0.8.0>]},{messages,[]},{links,[#Port<0.84>,<0.17.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,23},{reductions,505}],[]]}
{error_logger,{{2009,12,19},{12,19,16}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[[ejabberdctl,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2009,12,19},{12,19,16}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2009,12,19},{12,19,16}},crash_report,[[{pid,<0.7.0>},{registered_name,[]},{error_info,{exit,{shutdown,{kernel,start,[normal,[]]}},[{application_master,init,4},{proc_lib,init_p_do_apply,3}]}},{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{ancestors,[<0.6.0>]},{messages,[{'EXIT',<0.8.0>,normal}]},{links,[<0.6.0>,<0.5.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,233},{stack_size,23},{reductions,123}],[]]}
{error_logger,{{2009,12,19},{12,19,16}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid
terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
__

Beam is still consuming 100% of the cpu after a few minutes.  I'm going to
leave that script running to see what it does over the next few hours.

I imagine I now have to re-register all XO's?



On Sat, Dec 19, 2009 at 10:59 AM, Devon Connolly  wrote:

>
> Here is another example after it has been running all night.
>
> http://pastebin.com/m11537281
>
> As you can see, these runaway beam processes vary greatly in there RAM
> usage.  Also, they are always using 100% of the cpu.
>
> I will try to clear the DB now and see what happens.
>
>
>
> On Fri, Dec 18, 2009 at 12:51 PM, Martin Langhoff <
> martin.langh...@gmail.com> wrote:
>
>> On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly  wrote:
>> > Anyway, back on topic...  Here is that script slightly modified running
>> on
>> > a fresh boot.  I'm going to leave this looping and post the file to
>> > pastebin.  Here is an initial output after only like 10 minutes.  It
>> will
>> > get more interesting over time.  I'll paste another later this
>> afternoon.
>>
>> outrageous. beam should have only ~40MB in use, total.
>>
>> if you 'clear' the mnesia db as i suggested (keep a copy for
>> forensics!), does it get better?
>>
>>
>>
>> m
>> --
>>  martin.langh...@gmail.com
>>  mar...@laptop.org -- School Server Architect
>>  - ask interesting questions
>>  - don't get distracted with shiny stuff  - working code first
>>  - http://wiki.laptop.org/go/User:Martinlanghoff
>>
>
>
>
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

[Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-19 Thread Devon Connolly

Here is another example after it has been running all night.

http://pastebin.com/m11537281

As you can see, these runaway beam processes vary greatly in there RAM
usage.  Also, they are always using 100% of the cpu.

I will try to clear the DB now and see what happens.



On Fri, Dec 18, 2009 at 12:51 PM, Martin Langhoff  wrote:

> On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly  wrote:
> > Anyway, back on topic...  Here is that script slightly modified running
> on
> > a fresh boot.  I'm going to leave this looping and post the file to
> > pastebin.  Here is an initial output after only like 10 minutes.  It will
> > get more interesting over time.  I'll paste another later this afternoon.
>
> outrageous. beam should have only ~40MB in use, total.
>
> if you 'clear' the mnesia db as i suggested (keep a copy for
> forensics!), does it get better?
>
>
>
> m
> --
>  martin.langh...@gmail.com
>  mar...@laptop.org -- School Server Architect
>  - ask interesting questions
>  - don't get distracted with shiny stuff  - working code first
>  - http://wiki.laptop.org/go/User:Martinlanghoff
>
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-18 Thread Martin Langhoff

On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly  wrote:
> Anyway, back on topic...  Here is that script slightly modified running on
> a fresh boot.  I'm going to leave this looping and post the file to
> pastebin.  Here is an initial output after only like 10 minutes.  It will
> get more interesting over time.  I'll paste another later this afternoon.

outrageous. beam should have only ~40MB in use, total.

if you 'clear' the mnesia db as i suggested (keep a copy for
forensics!), does it get better?



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-18 Thread Devon Connolly


> Don't reinstall. If possible, let's try to debug this. If you're going
> to give up, just
>
> 1 - Backup /var/lib/ejabberd -- just tar it up
> 2 - Use the 'domain_config' script to change the domain -- this will
> re-generate the ejabberd mnesia database. What I'd do: change it to
> 'foo.com' and then back to the right domain.
>
I'd like to debug but I only have about a week left here so I need the  
server to be stable before I leave.  I can debug for awhile, but as we  
approach the holidays, I may need to throw in the table.

> I assume you have the different APs in different channels, and
> generally avoid channel 1 (as that's where XOs engage in 'mesh' by
> default...)...
>

What we really need is an RF site survey.  Unfortunately, there is nobody  
around that can.  They are on different channels but I am forced to use  
all 3 channels in such a small space.  We also have some rude neighbors  
that decided to amplify their WIFI on channel 6 essentially blanketing the  
school with interference on that channel.  So I have 1 AP on 6, 2 on  
channel 1, and 2 on channel 11.

Anyway, back on topic...  Here is that script slightly modified running on  
a fresh boot.  I'm going to leave this looping and post the file to  
pastebin.  Here is an initial output after only like 10 minutes.  It will  
get more interesting over time.  I'll paste another later this afternoon.

http://pastebin.com/m3426a094
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-17 Thread Martin Langhoff

On Thu, Dec 17, 2009 at 9:32 PM, Devon Connolly  wrote:
> The server had an uptime of about 50 days before this occurred.  There were
> no problems and nothing has changed in the 2 or so days since this problem
> began.  Like had said previously, it seems to have occurred since reflashing
> and re-registering a student's XO, but I believe that to be a coincidence.

Hmmm, maybe something's gone wonky on the mnesia DB.

> We are using 5 wireless AP's.  4 of which are Linksys WRT54G's running
> DD-WRT and one is a D-Link modem/AP combo.  DHCP is deactivated on all of
> the above.

Good.

>> - Did you also leave XOs running connected to it, or were XOs
>> completely disconnected?
>
> I believe all XO's were disconnected.  It is possible some were left
> connected while in their charging cabinets, but doubtful.

Ok. Then ejabberd is getting messedup all on its own...

> Nothing non-standard really.  eth0 is fixed.

good

> Although, this server came
> pre-installed from the folks involved with the Give One Get One program in
> Rwanda.  I'm not sure what was modified from the stock server install.  I am
> debating reinstalling the server from scratch.

Don't reinstall. If possible, let's try to debug this. If you're going
to give up, just

1 - Backup /var/lib/ejabberd -- just tar it up
2 - Use the 'domain_config' script to change the domain -- this will
re-generate the ejabberd mnesia database. What I'd do: change it to
'foo.com' and then back to the right domain.

> I attribute this behavior to the Linksys AP's as they only seem to
> handle about 20 connections per AP reliably.

yeah. we've seen that plenty.

>  There is also a good amount of
> wireless interference to contend with; however, the server was working
> well.

I assume you have the different APs in different channels, and
generally avoid channel 1 (as that's where XOs engage in 'mesh' by
default...)...


>>while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
>>ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done;
>
> Tried the script at night with the high load, and it cannot complete as the
> ejabberd node has since crashed.  ejabberdctl yields the following error:

Can you restart ejabberd and try that script?


> # ps_mem.py | grep ejabberd
>
> No output

Did you download ps_mem.py, and make it executable? (google the name
if needed) If so, you might want to grep for erl instead.

> I've included a screenshot of htop for your viewing pleasure.
> http://omploader.org/vMzBvZQ/htop_screen.jpg

ejbabberd sure looks busy there...



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-17 Thread Devon Connolly

The server had an uptime of about 50 days before this occurred.  There were
no problems and nothing has changed in the 2 or so days since this problem
began.  Like had said previously, it seems to have occurred since reflashing
and re-registering a student's XO, but I believe that to be a coincidence.

> - Are you perhaps using an AP that does its own DHCP? One way to
> check for certain is to connect an XO, and then grep /var/lib/dhcpd/
> (or is it /var/spool/dhcpd/ ?) for the MAC address of the XO

We are using 5 wireless AP's.  4 of which are Linksys WRT54G's running
DD-WRT and one is a D-Link modem/AP combo.  DHCP is deactivated on all of
the above.

> - Did you also leave XOs running connected to it, or were XOs
> completely disconnected?

I believe all XO's were disconnected.  It is possible some were left
connected while in their charging cabinets, but doubtful.

>Is there anything else that could be odd or non-standard in your
>setup? Are you in a VM? Is eth0 on the XS configured via dhcp with a
>short lease? Is there anything in the network between the XOs and the
>XS?

Nothing non-standard really.  eth0 is fixed.  Although, this server came
pre-installed from the folks involved with the Give One Get One program in
Rwanda.  I'm not sure what was modified from the stock server install.  I am
debating reinstalling the server from scratch.

I haven't been paying as much attention to the server lately as I should.
As it had been running for about 50 days, I only checked in with the school
periodically.  There were problems but mainly in relation to the presence
service and reliably connecting 30 - 100 laptops to the network at one
time.  I attribute this behavior to the Linksys AP's as they only seem to
handle about 20 connections per AP reliably.  There is also a good amount of
wireless interference to contend with; however, the server was working
well.  As it is a bit under-powered, load averages generally stay within the
1.2-1.5 range.

As I write this, the server has an uptime of about 9 hours.  Load averages
have reached 25 across the board.  The dump files have consumed over a gig
of space filling up the root partition.

>while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
>ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done;

Tried the script at night with the high load, and it cannot complete as the
ejabberd node has since crashed.  ejabberdctl yields the following error:

_
RPC failed on the node ejabb...@schoolserver: {'EXIT',
   {badarg,
[{ets,lookup,
  [hooks,
   {ejabberd_ctl_process,
global}]},

{ejabberd_hooks,run_fold,4},
 {ejabberd_ctl,process,1},
 {rpc,
  '-handle_call/3-fun-0-',
  5}]}}
__

Individually issuing the commands:
# vmstat
Thu Dec 17 20:07:19 UTC 2009
procs ---memory-- ---swap-- -io --system--
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id
wa st
25  0 705768  63912 123132 239040   53   92   153   711 1089  539 61 38  0
1  0

# ps_mem.py | grep ejabberd

No output

I've included a screenshot of htop for your viewing pleasure.

http://omploader.org/vMzBvZQ/htop_screen.jpg

I'll give you more relevant info tomorrow.

On Thu, Dec 17, 2009 at 12:16 PM, Martin Langhoff  wrote:

> On Thu, Dec 17, 2009 at 1:12 PM, Martin Langhoff
>  wrote
> > On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly 
> wrote:
> >> XS Version: 0.6
> >> 1 GB Physical Ram, 2GB Swap
> >
> > Ok - the RAM is on the low side for an XS but should handle 150 ok.
> >
> >> # ejabberdctl connected-users
> > ...
> > I counted 12 lines in the output of connected-users. That should not
> > cause trouble.
>
> Also - can you get your hands on ps_mem.py, and run it when the
> machine is getting into trouble? I want to correlate the output of
> ps_mem.py for ejabberd vs the number of connected users, run something
> like this on a console
>
> while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
> ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done;
>
> untested, may need tweaking to work properly. If you run it during the
> day and also during the night, will be most interesting.
>
> cheers,
>
>
> m
> --
>  martin.langh...@gmail.com
>  mar...@laptop.org -- School Server Architect
>  - ask interesting questions
>  - don't get distracted with shiny stuff  - working code first
>  - http://wiki.laptop.org/go/User:Martinlanghoff
>
_

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-17 Thread Martin Langhoff

On Thu, Dec 17, 2009 at 1:12 PM, Martin Langhoff
 wrote
> On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly  wrote:
>> XS Version: 0.6
>> 1 GB Physical Ram, 2GB Swap
>
> Ok - the RAM is on the low side for an XS but should handle 150 ok.
>
>> # ejabberdctl connected-users
> ...
> I counted 12 lines in the output of connected-users. That should not
> cause trouble.

Also - can you get your hands on ps_mem.py, and run it when the
machine is getting into trouble? I want to correlate the output of
ps_mem.py for ejabberd vs the number of connected users, run something
like this on a console

while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
ejabberdctl connected-users | wc-l) >> mylog ; sleep 60 ; done;

untested, may need tweaking to work properly. If you run it during the
day and also during the night, will be most interesting.

cheers,


m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-17 Thread Martin Langhoff

On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly  wrote:
> XS Version: 0.6
> 1 GB Physical Ram, 2GB Swap

Ok - the RAM is on the low side for an XS but should handle 150 ok.

> # ejabberdctl connected-users
...
I counted 12 lines in the output of connected-users. That should not
cause trouble.

> After leaving it on all night, load averages hit 30

 - Did you also leave XOs running connected to it, or were XOs
completely disconnected?

 - Are you perhaps using an AP that does its own DHCP? One way to
check for certain is to connect an XO, and then grep /var/lib/dhcpd/
(or is it /var/spool/dhcpd/ ?) for the MAC address of the XO

> {error_logger,{{2009,12,17},{10,0,25}},"Protocol: ~p: register error:

That crash dump is because it cannot spawn the new thread/process --
there's no hint in it of who/what is hogging them.

Seems that ejabberd is consuming all resources (network handles, RAM)
over time, even with no usage or very light usage. This is unexpected.
We did a lot of load-testing of ejabberd, with many clients
connecting, sending msgs, disconnecting over a period of time and we
never saw such resource leaks.

What we saw was memory usage growing a bit with connects/disconnects,
and a GC trimming it down periodically. Memory & cpu use was
reasonably stable over time, within that see-saw.

Is there anything else that could be odd or non-standard in your
setup? Are you in a VM? Is eth0 on the XS configured via dhcp with a
short lease? Is there anything in the network between the XOs and the
XS?

cheers,

m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-17 Thread Devon Connolly

XS Version: 0.6
1 GB Physical Ram, 2GB Swap
154 XO's Registered, Any number connected when the problem happens, 0-XX
The XS is controlling dhcp but nothing out of the ordinary as far as  
leases are concerned.
No Active Antenna

# /home/idmgr/list_registration
http://pastebin.com/m762076bb

# ejabberdctl stats registeredusers
154

# ejabberdctl connected-users

032a8890f8a9731cfc611580524176a1f8f6c...@schoolserver.notredame.sn/Telepathy
0a0c7fd971cdd25851ba34c9df66ef1845900...@schoolserver.notredame.sn/Telepathy
1c058ff553b654a3d808a3ffe95aadf4de841...@schoolserver.notredame.sn/Telepathy
26b8669a3e9387ac726296de07deced5aaf49...@schoolserver.notredame.sn/Telepathy
2f596cc8d6977519411f5c8fcc65e751e8bd3...@schoolserver.notredame.sn/Telepathy
909785500a4fc5e14fe9f1cd7657e7ac34440...@schoolserver.notredame.sn/Telepathy
9b2102f9af673393c9faa1f3565bd28773f48...@schoolserver.notredame.sn/Telepathy
b4e5426593e58970c1b5dafa2adb39e4c3e59...@schoolserver.notredame.sn/Telepathy
b7b58f3b01f49c8c652ddaedffd6faeef555b...@schoolserver.notredame.sn/Telepathy
efb20aece0870421fc0f3facc58653bdac922...@schoolserver.notredame.sn/Telepathy
f9b21026d27589b02b894e221e5531cd1edd1...@schoolserver.notredame.sn/Telepathy

# olpc-netstatus
//The XO's are using gabble

After leaving it on all night, load averages hit 30  It was  
unresponsive and any calls to ejabberdctl yielded the following error:

#ejabberdctl --node ejabb...@schoolserver connected-users
__
{error_logger,{{2009,12,17},{10,0,25}},"Protocol: ~p: register error:  
~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}
{error_logger,{{2009,12,17},{10,0,25}},crash_report,[[{pid,<0.20.0>},{registered_name,net_kernel},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{initial_call,{net_kernel,init,['Argument__1']}},{ancestors,[net_sup,kernel_sup,<0.8.0>]},{messages,[]},{links,[#Port<0.84>,<0.17.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,23},{reductions,506}],[]]}
{error_logger,{{2009,12,17},{10,0,25}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[[ejabberdctl,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2009,12,17},{10,0,25}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2009,12,17},{10,0,25}},crash_report,[[{pid,<0.7.0>},{registered_name,[]},{error_info,{exit,{shutdown,{kernel,start,[normal,[]]}},[{application_master,init,4},{proc_lib,init_p_do_apply,3}]}},{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{ancestors,[<0.6.0>]},{messages,[{'EXIT',<0.8.0>,normal}]},{links,[<0.6.0>,<0.5.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,233},{stack_size,23},{reductions,123}],[]]}
{error_logger,{{2009,12,17},{10,0,26}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid  
terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)  
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-16 Thread Martin Langhoff

Hi Devon,

Sure we can debug this. Lots of questions for you

 - version of XS?

 - How much physical RAM?

 - Number of XOs registered, and in use on the network when the problem happens

 - Output of the commands suggested in
http://wiki.laptop.org/go/XS_Techniques_and_Configuration#Presence_Service_.28ejabberd.29_Troubleshooting

 - Is there anything in the network that may be forcing lots of dhcpd
lease reassigns? Is the XS controlling dhcp for the XOs?

 - Are you by any chance using our old (and now unsupported) 'Active
Antenna' on the XS?

cheers,


m

On Wed, Dec 16, 2009 at 8:28 PM, Devon Connolly  wrote:
> I'm having some issues with ejabbered after re-flashing and re-registering a
> student's XO. No other changes were made to the server; however, the beam
> process has begun to constantly use 100% cpu while the ram usage swells to
> over 1GB and then proceeds to eat the 2GB swap.  This continues until the
> load average of the server reaches ~14,14,14 at which time the server
> becomes unresponsive.
>
> Multiple erl crash logs are being created (about 5-10 per minute) in
> /var/log/ejabberd.  A brief excerpt:
>
> erl_crash_20091216-124645.dump
> _
> =erl_crash_dump:0.1
> Wed Dec 16 12:46:47 2009
> Slogan: Kernel pid terminated (application_controller)
> ({application_start_failure, kernel, {shutdown, {kernel, start, [normal,
> []]}}})
> System version: Erlang (BEAM) emulator version 5.6.5 [source]
> [async-threads:0] [hipe][kernel-poll:false]
>
> --
> Anyway, each of these crash dump files are thousands of lines.  Any ideas
> for debugging this?
>
> Thanks
>
> ___
> Server-devel mailing list
> Server-devel@lists.laptop.org
> http://lists.laptop.org/listinfo/server-devel
>
>



-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

[Server-devel] Ejabberd CPU/RAM Spike -> Crashes

2009-12-16 Thread Devon Connolly

I'm having some issues with ejabbered after re-flashing and re-registering a
student's XO. No other changes were made to the server; however, the beam
process has begun to constantly use 100% cpu while the ram usage swells to
over 1GB and then proceeds to eat the 2GB swap.  This continues until the
load average of the server reaches ~14,14,14 at which time the server
becomes unresponsive.

Multiple erl crash logs are being created (about 5-10 per minute) in
/var/log/ejabberd.  A brief excerpt:

erl_crash_20091216-124645.dump
_
=erl_crash_dump:0.1
Wed Dec 16 12:46:47 2009
Slogan: Kernel pid terminated (application_controller)
({application_start_failure, kernel, {shutdown, {kernel, start, [normal,
[]]}}})
System version: Erlang (BEAM) emulator version 5.6.5 [source]
[async-threads:0] [hipe][kernel-poll:false]

--
Anyway, each of these crash dump files are thousands of lines.  Any ideas
for debugging this?

Thanks
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

[Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

Re: [Server-devel] Ejabberd CPU/RAM Spike -> Crashes

[Server-devel] Ejabberd CPU/RAM Spike -> Crashes

22 matches

Site Navigation

Mail list logo

Footer information