Hello all,
we have encountered the following situation during an overnight load test.

We get the following message repeatedly in the couch logs:

** Reason for termination ==

** {compaction_loop_died,

       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}



At one time, we are getting it three times in an interval of 5 seconds and I am 
guessing this causes the supervisor to shutdown temporary:


[Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}



In this particular component load test the CouchDB peer is shutdown so the 
replication is broken, meaning that there are a lot of backgrounds processes 
that try to replicate and die, and there is a thread that removes the failed 
replication and re-enables them (probably this is not a good idea anymore since 
CouchDB detects that the peer came back online on its own now).  I suspect that 
this might be related.



In the Zenoss graphs we see a very significant spike in the IO read /writes at 
that moment.



Thank you very much for your time, and any hint will be appreciated.

Reply via email to