CouchDB remains running on node0-couch1 and node0-couch2, but with no replication active. One node1 and node2, all CouchDB instances have crashed completely.
On 5 mar 2010, at 14.22, Robert Newson wrote: > Is couchdb crashing or just the replication tasks? > > On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <[email protected]> > wrote: >> The amount of logged data on the six servers is vast, but this is the crash >> message on node0-couch1. It's perhaps easier if I make the full log files >> available (give me a shout). Here's the snippet: >> >> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server >> <0.2092.0> terminating >> ** Last message in was {ibrowse_async_response, >> {1267,713465,777255}, >> {error,connection_closed}} >> ** When Server state == {state,nil,nil, >> [<0.2077.0>, >> {http_db, >> >> "http://couch2.staging.diino.com:5984/laplace_conf_staging/", >> [{"User-Agent","CouchDB/0.10.1"}, >> {"Accept","application/json"}, >> {"Accept-Encoding","gzip"}], >> [],get,nil, >> [{response_format,binary}, >> {inactivity_timeout,30000}], >> 10,500,nil}, >> 251, >> [{<<"continuous">>,true}, >> {<<"source">>, >> >> <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>}, >> {<<"target">>, >> >> <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]], >> 251,<0.2093.0>, >> {1267,713465,777255}, >> false,0,<<>>, >> {<0.2095.0>,#Ref<0.0.0.131534>}, >> ** Reason for termination == >> ** {error,connection_closed} >> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server >> <0.2130.0> terminating >> ** Last message in was {ibrowse_async_response, >> {1267,713465,843079}, >> {error,connection_closed}} >> ** When Server state == {state,nil,nil, >> [<0.2106.0>, >> {http_db, >> >> "http://couch2.staging.diino.com:5984/laplace_log_staging/", >> [{"User-Agent","CouchDB/0.10.1"}, >> {"Accept","application/json"}, >> {"Accept-Encoding","gzip"}], >> [],get,nil, >> [{response_format,binary}, >> {inactivity_timeout,30000}], >> 10,500,nil}, >> 28136, >> [{<<"continuous">>,true}, >> {<<"source">>, >> >> <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>}, >> {<<"target">>, >> >> <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]], >> 29086,<0.2131.0>, >> {1267,713465,843079}, >> false,0,<<>>, >> {<0.2133.0>,#Ref<0.0.5.183681>}, >> ** Reason for termination == >> ** {error,connection_closed} >> >> >> >> On 5 mar 2010, at 13.44, Robert Newson wrote: >> >>> Can you include some of the log output? >>> >>> A coordinated failure like this points to external factors but log >>> output will help in any case. >>> >>> B. >>> >>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]> >>> wrote: >>>> We have a cluster of servers. At the moment there are three servers, each >>>> having two separate instances of CouchDB, like this: >>>> >>>> node0-couch1 >>>> node0-couch2 >>>> >>>> node1-couch1 >>>> node1-couch2 >>>> >>>> node2-couch1 >>>> node2-couch2 >>>> >>>> All couch1 instances are set up to replicate continuously using >>>> bidirectional pull replication. That is: >>>> >>>> node0-couch1 pulls from node1-couch1 and node2-couch1 >>>> node1-couch1 pulls from node0-couch1 and node2-couch1 >>>> node2-couch1 pulls from node0-couch1 and node1-couch1 >>>> >>>> On each node, couch1 and couch2 are set up to replicate each other >>>> continuously, again using pull replication. Thus, the full replication >>>> topology is: >>>> >>>> node0-couch1 pulls from node1-couch, node2-couch1, and >>>> node0-couch2 >>>> node0-couch2 pulls from node0-couch1 >>>> >>>> node1-couch1 pulls from node0-couch1, node2-couch1, and >>>> node1-couch2 >>>> node1-couch2 pulls from node1-couch1 >>>> >>>> node2-couch1 pulls from node0-couch1, node1-couch1, and >>>> node2-couch2 >>>> node2-couch2 pulls from node2-couch1 >>>> >>>> No proxies are involved. In our staging system, all servers are on the >>>> same subnet. >>>> >>>> The problem is that every night, the entire cluster dies. All instances of >>>> CouchDB crash, and moreover they crash exactly simultaneously. >>>> >>>> The data being replicated is very minimal at the moment - simple log text >>>> lines, no attachments. The entire database being replicated is no more >>>> than a few megabytes in size. >>>> >>>> The syslogs give no clue. The CouchDB logs are difficult to interpret >>>> unless you are an Erlang programmer. If anyone would care to look at them, >>>> just let me know. >>>> >>>> Any clues as to why this is happening? We're using 0.10.1 on Debian. >>>> >>>> We are planning to build quite sophisticated transcluster job queue >>>> functionality on top of CouchDB, but of course a situation like this >>>> suggests that CouchDB replication currently is too unreliable to be of >>>> practical use, unless this is a known bug and/or a fixed one. >>>> >>>> Any pointers or ideas are most welcome. >>>> >>>> / Peter Bengtson >>>> >>>> >>>> >> >>
