On Aug 13, 2013, at 6:15 PM, Joan Touzet wrote:
> On Tue, Aug 13, 2013 at 02:49:28PM -0500, Nathan Vander Wilt wrote:
>> I've got 1.7GB disk free and 2GB of memory available at the moment, so it
>> doesn't seem to be either of those. (I could not find any out-of-memory
>> process kill logs in /var/log/syslog.) The only clue I can find is in
>> couchdb.stderr:
>> heart_beat_kill_pid = 1390
>> heart_beat_timeout = 11
>> heart: Tue Aug 13 18:34:21 2013: heart-beat time-out, no activity for 15
>> seconds
>> Killed
>
> So 15s of system clock time passed without erlang's heart receiving a
> ping back. There's a number of possibilities; for instance, if this is a
> VM and the clock was advanced/changed by 15s to synchronize with the
> main system, heart might see that and issue a kill command. Another
> could be extremely heavy load on the system forcing the second couch
> process to get swapped out.
>
> Three suggestions:
>
> 1. set RESPAWN_TIMEOUT to a non-zero value to force couch to restart
> after a kill. Because of its crash-only design this is safe, and
> since restarts are rare you're liable to not really be running
> into serious issues.
> 2. Crank up logging to debug level to see what might be going on
> when the heartbeat fails to respond.
> 3. Add some additional system monitoring to ensure that you're not
> overloading your system on CPU, RAM, I/O or network traffic.
> Do you have a lot of views building / heavy system load due to
> couchjs processes?
Thanks for these suggestions, Joan. Unfortunately it seems the server is under
quite the opposite situation though: this is an m1.medium instance used as a
dev server that spends most of its time neglected ("System load: 0.0", "Memory
usage: 18%") — mostly just the two CouchDB daemons each getting a half dozen
already-caught-up replications triggered every 10 minutes, and one node.js
server sitting waiting for someone to log in.
It looks like I am already using RESPAWN_TIMEOUT via the -r command line
option. That's why I was surprised the server stayed down for the hour or so it
took for us to notice.
I'm guessing that to crank up logging I need to set _config/log/level to the
string "debug", and I will try that if this problem keeps reoccurring. I'm
hesitant to simply set it now, as I already know that running out of disk space
has also caused CouchDB to fail to respawn ;-)
thx,
-nvw