Is couchdb crashing or just the replication tasks?
On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <[email protected]> wrote: > The amount of logged data on the six servers is vast, but this is the crash > message on node0-couch1. It's perhaps easier if I make the full log files > available (give me a shout). Here's the snippet: > > [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server > <0.2092.0> terminating > ** Last message in was {ibrowse_async_response, > {1267,713465,777255}, > {error,connection_closed}} > ** When Server state == {state,nil,nil, > [<0.2077.0>, > {http_db, > > "http://couch2.staging.diino.com:5984/laplace_conf_staging/", > [{"User-Agent","CouchDB/0.10.1"}, > {"Accept","application/json"}, > {"Accept-Encoding","gzip"}], > [],get,nil, > [{response_format,binary}, > {inactivity_timeout,30000}], > 10,500,nil}, > 251, > [{<<"continuous">>,true}, > {<<"source">>, > > <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>}, > {<<"target">>, > > <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]], > 251,<0.2093.0>, > {1267,713465,777255}, > false,0,<<>>, > {<0.2095.0>,#Ref<0.0.0.131534>}, > ** Reason for termination == > ** {error,connection_closed} > [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server > <0.2130.0> terminating > ** Last message in was {ibrowse_async_response, > {1267,713465,843079}, > {error,connection_closed}} > ** When Server state == {state,nil,nil, > [<0.2106.0>, > {http_db, > > "http://couch2.staging.diino.com:5984/laplace_log_staging/", > [{"User-Agent","CouchDB/0.10.1"}, > {"Accept","application/json"}, > {"Accept-Encoding","gzip"}], > [],get,nil, > [{response_format,binary}, > {inactivity_timeout,30000}], > 10,500,nil}, > 28136, > [{<<"continuous">>,true}, > {<<"source">>, > > <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>}, > {<<"target">>, > > <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]], > 29086,<0.2131.0>, > {1267,713465,843079}, > false,0,<<>>, > {<0.2133.0>,#Ref<0.0.5.183681>}, > ** Reason for termination == > ** {error,connection_closed} > > > > On 5 mar 2010, at 13.44, Robert Newson wrote: > >> Can you include some of the log output? >> >> A coordinated failure like this points to external factors but log >> output will help in any case. >> >> B. >> >> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]> >> wrote: >>> We have a cluster of servers. At the moment there are three servers, each >>> having two separate instances of CouchDB, like this: >>> >>> node0-couch1 >>> node0-couch2 >>> >>> node1-couch1 >>> node1-couch2 >>> >>> node2-couch1 >>> node2-couch2 >>> >>> All couch1 instances are set up to replicate continuously using >>> bidirectional pull replication. That is: >>> >>> node0-couch1 pulls from node1-couch1 and node2-couch1 >>> node1-couch1 pulls from node0-couch1 and node2-couch1 >>> node2-couch1 pulls from node0-couch1 and node1-couch1 >>> >>> On each node, couch1 and couch2 are set up to replicate each other >>> continuously, again using pull replication. Thus, the full replication >>> topology is: >>> >>> node0-couch1 pulls from node1-couch, node2-couch1, and >>> node0-couch2 >>> node0-couch2 pulls from node0-couch1 >>> >>> node1-couch1 pulls from node0-couch1, node2-couch1, and >>> node1-couch2 >>> node1-couch2 pulls from node1-couch1 >>> >>> node2-couch1 pulls from node0-couch1, node1-couch1, and >>> node2-couch2 >>> node2-couch2 pulls from node2-couch1 >>> >>> No proxies are involved. In our staging system, all servers are on the same >>> subnet. >>> >>> The problem is that every night, the entire cluster dies. All instances of >>> CouchDB crash, and moreover they crash exactly simultaneously. >>> >>> The data being replicated is very minimal at the moment - simple log text >>> lines, no attachments. The entire database being replicated is no more than >>> a few megabytes in size. >>> >>> The syslogs give no clue. The CouchDB logs are difficult to interpret >>> unless you are an Erlang programmer. If anyone would care to look at them, >>> just let me know. >>> >>> Any clues as to why this is happening? We're using 0.10.1 on Debian. >>> >>> We are planning to build quite sophisticated transcluster job queue >>> functionality on top of CouchDB, but of course a situation like this >>> suggests that CouchDB replication currently is too unreliable to be of >>> practical use, unless this is a known bug and/or a fixed one. >>> >>> Any pointers or ideas are most welcome. >>> >>> / Peter Bengtson >>> >>> >>> > >
