Hello all,
we have encountered the following situation during an overnight load test.
We get the following message repeatedly in the couch logs:
** Reason for termination ==
** {compaction_loop_died,
{timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
At one time, we are getting it three times in an interval of 5 seconds and I am
guessing this causes the supervisor to shutdown temporary:
[Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
{<0.93.0>,supervisor_report,
[{supervisor,{local,couch_secondary_services}},
{errorContext,shutdown},
{reason,reached_max_restart_intensity},
{offender,
[{pid,<0.10114.14>},
{name,compaction_daemon},
{mfargs,{couch_compaction_daemon,start_link,[]}},
{restart_type,permanent},
{shutdown,brutal_kill},
{child_type,worker}]}]}}
In this particular component load test the CouchDB peer is shutdown so the
replication is broken, meaning that there are a lot of backgrounds processes
that try to replicate and die, and there is a thread that removes the failed
replication and re-enables them (probably this is not a good idea anymore since
CouchDB detects that the peer came back online on its own now). I suspect that
this might be related.
In the Zenoss graphs we see a very significant spike in the IO read /writes at
that moment.
Thank you very much for your time, and any hint will be appreciated.