We have turned on debugging for this test and it looks like the cause of this error is the _replicator database.
After the list of fragmented databases we see no evidence that the compaction for this database is being started in the log ( although fragmentation is and above the 70% threshold) and then we have the compaction loop dying after approximately 5 seconds. So I am guessing CouchDB fails to spawn the compaction process. I forgot to mention in the first post that we are running CouchDB 1.6.1 on a Centos 6.4 server. Thanks for your time, any help will be appreciated. -----Original Message----- From: Ciprian Trusca [mailto:[email protected]] Sent: Thursday, November 27, 2014 10:17 AM To: [email protected] Subject: compaction repeated timeouts causes the server to shutdown temporary when replication is broken Hello all, we have encountered the following situation during an overnight load test. We get the following message repeatedly in the couch logs: ** Reason for termination == ** {compaction_loop_died, {timeout,{gen_server,call,[<0.117.0>,start_compact]}}} At one time, we are getting it three times in an interval of 5 seconds and I am guessing this causes the supervisor to shutdown temporary: [Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>, {<0.93.0>,supervisor_report, [{supervisor,{local,couch_secondary_services}}, {errorContext,shutdown}, {reason,reached_max_restart_intensity}, {offender, [{pid,<0.10114.14>}, {name,compaction_daemon}, {mfargs,{couch_compaction_daemon,start_link,[]}}, {restart_type,permanent}, {shutdown,brutal_kill}, {child_type,worker}]}]}} In this particular component load test the CouchDB peer is shutdown so the replication is broken, meaning that there are a lot of backgrounds processes that try to replicate and die, and there is a thread that removes the failed replication and re-enables them (probably this is not a good idea anymore since CouchDB detects that the peer came back online on its own now). I suspect that this might be related. In the Zenoss graphs we see a very significant spike in the IO read /writes at that moment. Thank you very much for your time, and any hint will be appreciated.
