Heya Ciprian,

this sounds like a bug, could you file an issue on 
https://issues.apache.org/jira/browse/COUCHDB

Best
Jan
--

> On 05 Dec 2014, at 08:51 , Ciprian Trusca <ctru...@totalsoft.ro> wrote:
> 
> We have turned on debugging for this test and it looks like the cause of this 
> error is the _replicator database.  
> 
> After the list of fragmented databases we see no evidence that the compaction 
> for this database is being started in the log ( although fragmentation is and 
> above the 70% threshold) and then we have the compaction loop dying  after 
> approximately 5 seconds.  So I am guessing CouchDB fails to spawn the 
> compaction process.  
> 
> I forgot to mention in the first post that we are running CouchDB 1.6.1 on a 
> Centos 6.4 server.
> 
> Thanks for your time, any help will be appreciated.
> 
> -----Original Message-----
> From: Ciprian Trusca [mailto:ctru...@totalsoft.ro] 
> Sent: Thursday, November 27, 2014 10:17 AM
> To: user@couchdb.apache.org
> Subject: compaction repeated timeouts causes the server to shutdown temporary 
> when replication is broken
> 
> Hello all,
> we have encountered the following situation during an overnight load test.
> 
> We get the following message repeatedly in the couch logs:
> 
> ** Reason for termination ==
> 
> ** {compaction_loop_died,
> 
>       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
> 
> 
> 
> At one time, we are getting it three times in an interval of 5 seconds and I 
> am guessing this causes the supervisor to shutdown temporary:
> 
> 
> [Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
>                       {<0.93.0>,supervisor_report,
>                        [{supervisor,{local,couch_secondary_services}},
>                         {errorContext,shutdown},
>                         {reason,reached_max_restart_intensity},
>                         {offender,
>                             [{pid,<0.10114.14>},
>                              {name,compaction_daemon},
>                              {mfargs,{couch_compaction_daemon,start_link,[]}},
>                              {restart_type,permanent},
>                              {shutdown,brutal_kill},
>                              {child_type,worker}]}]}}
> 
> 
> 
> In this particular component load test the CouchDB peer is shutdown so the 
> replication is broken, meaning that there are a lot of backgrounds processes 
> that try to replicate and die, and there is a thread that removes the failed 
> replication and re-enables them (probably this is not a good idea anymore 
> since CouchDB detects that the peer came back online on its own now).  I 
> suspect that this might be related.
> 
> 
> 
> In the Zenoss graphs we see a very significant spike in the IO read /writes 
> at that moment.
> 
> 
> 
> Thank you very much for your time, and any hint will be appreciated.

Reply via email to