compaction repeated timeouts causes the server to shutdown temporary when replication is broken

Ciprian Trusca Thu, 27 Nov 2014 00:19:30 -0800

Hello all,
we have encountered the following situation during an overnight load test.


We get the following message repeatedly in the couch logs:

** Reason for termination ==

** {compaction_loop_died,

       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}



At one time, we are getting it three times in an interval of 5 seconds and I am 
guessing this causes the supervisor to shutdown temporary:


[Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}



In this particular component load test the CouchDB peer is shutdown so the 
replication is broken, meaning that there are a lot of backgrounds processes 
that try to replicate and die, and there is a thread that removes the failed 
replication and re-enables them (probably this is not a good idea anymore since 
CouchDB detects that the peer came back online on its own now).  I suspect that 
this might be related.



In the Zenoss graphs we see a very significant spike in the IO read /writes at 
that moment.



Thank you very much for your time, and any hint will be appreciated.

compaction repeated timeouts causes the server to shutdown temporary when replication is broken

Reply via email to