Re: Entire CouchDB cluster crashes simultaneously

Peter Bengtson Fri, 05 Mar 2010 05:44:17 -0800

CouchDB remains running on node0-couch1 and node0-couch2, but with no 
replication active. One node1 and node2, all CouchDB instances have crashed 
completely.



On 5 mar 2010, at 14.22, Robert Newson wrote:

> Is couchdb crashing or just the replication tasks?
> 
> On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <[email protected]> 
> wrote:
>> The amount of logged data on the six servers is vast, but this is the crash 
>> message on node0-couch1. It's perhaps easier if I make the full log files 
>> available (give me a shout). Here's the snippet:
>> 
>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server 
>> <0.2092.0> terminating
>> ** Last message in was {ibrowse_async_response,
>>                           {1267,713465,777255},
>>                           {error,connection_closed}}
>> ** When Server state == {state,nil,nil,
>>                            [<0.2077.0>,
>>                             {http_db,
>>                                 
>> "http://couch2.staging.diino.com:5984/laplace_conf_staging/";,
>>                                 [{"User-Agent","CouchDB/0.10.1"},
>>                                  {"Accept","application/json"},
>>                                  {"Accept-Encoding","gzip"}],
>>                                 [],get,nil,
>>                                 [{response_format,binary},
>>                                  {inactivity_timeout,30000}],
>>                                 10,500,nil},
>>                             251,
>>                             [{<<"continuous">>,true},
>>                              {<<"source">>,
>>                               
>> <<"http://couch2.staging.diino.com:5984/laplace_conf_staging";>>},
>>                              {<<"target">>,
>>                               
>> <<"http://couch1.staging.diino.com:5984/laplace_conf_staging";>>}]],
>>                            251,<0.2093.0>,
>>                            {1267,713465,777255},
>>                            false,0,<<>>,
>>                            {<0.2095.0>,#Ref<0.0.0.131534>},
>> ** Reason for termination ==
>> ** {error,connection_closed}
>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server 
>> <0.2130.0> terminating
>> ** Last message in was {ibrowse_async_response,
>>                           {1267,713465,843079},
>>                           {error,connection_closed}}
>> ** When Server state == {state,nil,nil,
>>                            [<0.2106.0>,
>>                             {http_db,
>>                                 
>> "http://couch2.staging.diino.com:5984/laplace_log_staging/";,
>>                                 [{"User-Agent","CouchDB/0.10.1"},
>>                                  {"Accept","application/json"},
>>                                  {"Accept-Encoding","gzip"}],
>>                                 [],get,nil,
>>                                 [{response_format,binary},
>>                                  {inactivity_timeout,30000}],
>>                                 10,500,nil},
>>                             28136,
>>                             [{<<"continuous">>,true},
>>                              {<<"source">>,
>>                               
>> <<"http://couch2.staging.diino.com:5984/laplace_log_staging";>>},
>>                              {<<"target">>,
>>                               
>> <<"http://couch1.staging.diino.com:5984/laplace_log_staging";>>}]],
>>                            29086,<0.2131.0>,
>>                            {1267,713465,843079},
>>                            false,0,<<>>,
>>                            {<0.2133.0>,#Ref<0.0.5.183681>},
>> ** Reason for termination ==
>> ** {error,connection_closed}
>> 
>> 
>> 
>> On 5 mar 2010, at 13.44, Robert Newson wrote:
>> 
>>> Can you include some of the log output?
>>> 
>>> A coordinated failure like this points to external factors but log
>>> output will help in any case.
>>> 
>>> B.
>>> 
>>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]> 
>>> wrote:
>>>> We have a cluster of servers. At the moment there are three servers, each 
>>>> having two separate instances of CouchDB, like this:
>>>> 
>>>>        node0-couch1
>>>>        node0-couch2
>>>> 
>>>>        node1-couch1
>>>>        node1-couch2
>>>> 
>>>>        node2-couch1
>>>>        node2-couch2
>>>> 
>>>> All couch1 instances are set up to replicate continuously using 
>>>> bidirectional pull replication. That is:
>>>> 
>>>>        node0-couch1    pulls from node1-couch1 and node2-couch1
>>>>        node1-couch1    pulls from node0-couch1 and node2-couch1
>>>>        node2-couch1    pulls from node0-couch1 and node1-couch1
>>>> 
>>>> On each node, couch1 and couch2 are set up to replicate each other 
>>>> continuously, again using pull replication. Thus, the full replication 
>>>> topology is:
>>>> 
>>>>        node0-couch1    pulls from node1-couch, node2-couch1, and 
>>>> node0-couch2
>>>>        node0-couch2    pulls from node0-couch1
>>>> 
>>>>        node1-couch1    pulls from node0-couch1, node2-couch1, and 
>>>> node1-couch2
>>>>        node1-couch2    pulls from node1-couch1
>>>> 
>>>>        node2-couch1    pulls from node0-couch1, node1-couch1, and 
>>>> node2-couch2
>>>>        node2-couch2    pulls from node2-couch1
>>>> 
>>>> No proxies are involved. In our staging system, all servers are on the 
>>>> same subnet.
>>>> 
>>>> The problem is that every night, the entire cluster dies. All instances of 
>>>> CouchDB crash, and moreover they crash exactly simultaneously.
>>>> 
>>>> The data being replicated is very minimal at the moment - simple log text 
>>>> lines, no attachments. The entire database being replicated is no more 
>>>> than a few megabytes in size.
>>>> 
>>>> The syslogs give no clue. The CouchDB logs are difficult to interpret 
>>>> unless you are an Erlang programmer. If anyone would care to look at them, 
>>>> just let me know.
>>>> 
>>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>> 
>>>> We are planning to build quite sophisticated transcluster job queue 
>>>> functionality on top of CouchDB, but of course a situation like this 
>>>> suggests that CouchDB replication currently is too unreliable to be of 
>>>> practical use, unless this is a known bug and/or a fixed one.
>>>> 
>>>> Any pointers or ideas are most welcome.
>>>> 
>>>>        / Peter Bengtson
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Entire CouchDB cluster crashes simultaneously

Reply via email to