On Tue, Aug 13, 2013 at 02:49:28PM -0500, Nathan Vander Wilt wrote:
> I've got 1.7GB disk free and 2GB of memory available at the moment, so it
> doesn't seem to be either of those. (I could not find any out-of-memory
> process kill logs in /var/log/syslog.) The only clue I can find is in
> couchdb.stderr:
> heart_beat_kill_pid = 1390
> heart_beat_timeout = 11
> heart: Tue Aug 13 18:34:21 2013: heart-beat time-out, no activity for 15
> seconds
> Killed
So 15s of system clock time passed without erlang's heart receiving a
ping back. There's a number of possibilities; for instance, if this is a
VM and the clock was advanced/changed by 15s to synchronize with the
main system, heart might see that and issue a kill command. Another
could be extremely heavy load on the system forcing the second couch
process to get swapped out.
Three suggestions:
1. set RESPAWN_TIMEOUT to a non-zero value to force couch to restart
after a kill. Because of its crash-only design this is safe, and
since restarts are rare you're liable to not really be running
into serious issues.
2. Crank up logging to debug level to see what might be going on
when the heartbeat fails to respond.
3. Add some additional system monitoring to ensure that you're not
overloading your system on CPU, RAM, I/O or network traffic.
Do you have a lot of views building / heavy system load due to
couchjs processes?
--
Joan Touzet | [email protected] | wohali everywhere else