Re: Respawning server died, can't figure out why

Nathan Vander Wilt Tue, 20 Aug 2013 07:50:14 -0700

On Aug 13, 2013, at 6:15 PM, Joan Touzet wrote:

> On Tue, Aug 13, 2013 at 02:49:28PM -0500, Nathan Vander Wilt wrote:
>> I've got 1.7GB disk free and 2GB of memory available at the moment, so it 
>> doesn't seem to be either of those. (I could not find any out-of-memory 
>> process kill logs in /var/log/syslog.) The only clue I can find is in 
>> couchdb.stderr:
>>    heart_beat_kill_pid = 1390
>>    heart_beat_timeout = 11
>>    heart: Tue Aug 13 18:34:21 2013: heart-beat time-out, no activity for 15 
>> seconds
>>    Killed
> 
> So 15s of system clock time passed without erlang's heart receiving a
> ping back. There's a number of possibilities; for instance, if this is a
> VM and the clock was advanced/changed by 15s to synchronize with the
> main system, heart might see that and issue a kill command. Another
> could be extremely heavy load on the system forcing the second couch
> process to get swapped out.
> 
> Three suggestions:
> 
>  1. set RESPAWN_TIMEOUT to a non-zero value to force couch to restart
>     after a kill. Because of its crash-only design this is safe, and
>     since restarts are rare you're liable to not really be running
>     into serious issues.
>  2. Crank up logging to debug level to see what might be going on
>     when the heartbeat fails to respond.
>  3. Add some additional system monitoring to ensure that you're not
>     overloading your system on CPU, RAM, I/O or network traffic.
>     Do you have a lot of views building / heavy system load due to
>     couchjs processes?



Thanks for these suggestions, Joan. Unfortunately it seems the server is under 
quite the opposite situation though: this is an m1.medium instance used as a 
dev server that spends most of its time neglected ("System load:  0.0", "Memory 
usage: 18%") — mostly just the two CouchDB daemons each getting a half dozen 
already-caught-up replications triggered every 10 minutes, and one node.js 
server sitting waiting for someone to log in.

It looks like I am already using RESPAWN_TIMEOUT via the -r command line 
option. That's why I was surprised the server stayed down for the hour or so it 
took for us to notice.

I'm guessing that to crank up logging I need to set _config/log/level to the 
string "debug", and I will try that if this problem keeps reoccurring. I'm 
hesitant to simply set it now, as I already know that running out of disk space 
has also caused CouchDB to fail to respawn ;-)

thx,
-nvw

Re: Respawning server died, can't figure out why

Reply via email to