Can you include some of the log output? A coordinated failure like this points to external factors but log output will help in any case.
B. On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]> wrote: > We have a cluster of servers. At the moment there are three servers, each > having two separate instances of CouchDB, like this: > > node0-couch1 > node0-couch2 > > node1-couch1 > node1-couch2 > > node2-couch1 > node2-couch2 > > All couch1 instances are set up to replicate continuously using bidirectional > pull replication. That is: > > node0-couch1 pulls from node1-couch1 and node2-couch1 > node1-couch1 pulls from node0-couch1 and node2-couch1 > node2-couch1 pulls from node0-couch1 and node1-couch1 > > On each node, couch1 and couch2 are set up to replicate each other > continuously, again using pull replication. Thus, the full replication > topology is: > > node0-couch1 pulls from node1-couch, node2-couch1, and node0-couch2 > node0-couch2 pulls from node0-couch1 > > node1-couch1 pulls from node0-couch1, node2-couch1, and node1-couch2 > node1-couch2 pulls from node1-couch1 > > node2-couch1 pulls from node0-couch1, node1-couch1, and node2-couch2 > node2-couch2 pulls from node2-couch1 > > No proxies are involved. In our staging system, all servers are on the same > subnet. > > The problem is that every night, the entire cluster dies. All instances of > CouchDB crash, and moreover they crash exactly simultaneously. > > The data being replicated is very minimal at the moment - simple log text > lines, no attachments. The entire database being replicated is no more than a > few megabytes in size. > > The syslogs give no clue. The CouchDB logs are difficult to interpret unless > you are an Erlang programmer. If anyone would care to look at them, just let > me know. > > Any clues as to why this is happening? We're using 0.10.1 on Debian. > > We are planning to build quite sophisticated transcluster job queue > functionality on top of CouchDB, but of course a situation like this suggests > that CouchDB replication currently is too unreliable to be of practical use, > unless this is a known bug and/or a fixed one. > > Any pointers or ideas are most welcome. > > / Peter Bengtson > > >
