We have turned on debugging for this test and it looks like the cause of this 
error is the _replicator database.  

After the list of fragmented databases we see no evidence that the compaction 
for this database is being started in the log ( although fragmentation is and 
above the 70% threshold) and then we have the compaction loop dying  after 
approximately 5 seconds.  So I am guessing CouchDB fails to spawn the 
compaction process.  

I forgot to mention in the first post that we are running CouchDB 1.6.1 on a 
Centos 6.4 server.

Thanks for your time, any help will be appreciated.

-----Original Message-----
From: Ciprian Trusca [mailto:[email protected]] 
Sent: Thursday, November 27, 2014 10:17 AM
To: [email protected]
Subject: compaction repeated timeouts causes the server to shutdown temporary 
when replication is broken

Hello all,
we have encountered the following situation during an overnight load test.

We get the following message repeatedly in the couch logs:

** Reason for termination ==

** {compaction_loop_died,

       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}



At one time, we are getting it three times in an interval of 5 seconds and I am 
guessing this causes the supervisor to shutdown temporary:


[Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}



In this particular component load test the CouchDB peer is shutdown so the 
replication is broken, meaning that there are a lot of backgrounds processes 
that try to replicate and die, and there is a thread that removes the failed 
replication and re-enables them (probably this is not a good idea anymore since 
CouchDB detects that the peer came back online on its own now).  I suspect that 
this might be related.



In the Zenoss graphs we see a very significant spike in the IO read /writes at 
that moment.



Thank you very much for your time, and any hint will be appreciated.

Reply via email to