Heya Ciprian, this sounds like a bug, could you file an issue on https://issues.apache.org/jira/browse/COUCHDB
Best Jan -- > On 05 Dec 2014, at 08:51 , Ciprian Trusca <ctru...@totalsoft.ro> wrote: > > We have turned on debugging for this test and it looks like the cause of this > error is the _replicator database. > > After the list of fragmented databases we see no evidence that the compaction > for this database is being started in the log ( although fragmentation is and > above the 70% threshold) and then we have the compaction loop dying after > approximately 5 seconds. So I am guessing CouchDB fails to spawn the > compaction process. > > I forgot to mention in the first post that we are running CouchDB 1.6.1 on a > Centos 6.4 server. > > Thanks for your time, any help will be appreciated. > > -----Original Message----- > From: Ciprian Trusca [mailto:ctru...@totalsoft.ro] > Sent: Thursday, November 27, 2014 10:17 AM > To: user@couchdb.apache.org > Subject: compaction repeated timeouts causes the server to shutdown temporary > when replication is broken > > Hello all, > we have encountered the following situation during an overnight load test. > > We get the following message repeatedly in the couch logs: > > ** Reason for termination == > > ** {compaction_loop_died, > > {timeout,{gen_server,call,[<0.117.0>,start_compact]}}} > > > > At one time, we are getting it three times in an interval of 5 seconds and I > am guessing this causes the supervisor to shutdown temporary: > > > [Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>, > {<0.93.0>,supervisor_report, > [{supervisor,{local,couch_secondary_services}}, > {errorContext,shutdown}, > {reason,reached_max_restart_intensity}, > {offender, > [{pid,<0.10114.14>}, > {name,compaction_daemon}, > {mfargs,{couch_compaction_daemon,start_link,[]}}, > {restart_type,permanent}, > {shutdown,brutal_kill}, > {child_type,worker}]}]}} > > > > In this particular component load test the CouchDB peer is shutdown so the > replication is broken, meaning that there are a lot of backgrounds processes > that try to replicate and die, and there is a thread that removes the failed > replication and re-enables them (probably this is not a good idea anymore > since CouchDB detects that the peer came back online on its own now). I > suspect that this might be related. > > > > In the Zenoss graphs we see a very significant spike in the IO read /writes > at that moment. > > > > Thank you very much for your time, and any hint will be appreciated.