Can you include some of the log output?

A coordinated failure like this points to external factors but log
output will help in any case.

B.

On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]> wrote:
> We have a cluster of servers. At the moment there are three servers, each 
> having two separate instances of CouchDB, like this:
>
>        node0-couch1
>        node0-couch2
>
>        node1-couch1
>        node1-couch2
>
>        node2-couch1
>        node2-couch2
>
> All couch1 instances are set up to replicate continuously using bidirectional 
> pull replication. That is:
>
>        node0-couch1    pulls from node1-couch1 and node2-couch1
>        node1-couch1    pulls from node0-couch1 and node2-couch1
>        node2-couch1    pulls from node0-couch1 and node1-couch1
>
> On each node, couch1 and couch2 are set up to replicate each other 
> continuously, again using pull replication. Thus, the full replication 
> topology is:
>
>        node0-couch1    pulls from node1-couch, node2-couch1, and node0-couch2
>        node0-couch2    pulls from node0-couch1
>
>        node1-couch1    pulls from node0-couch1, node2-couch1, and node1-couch2
>        node1-couch2    pulls from node1-couch1
>
>        node2-couch1    pulls from node0-couch1, node1-couch1, and node2-couch2
>        node2-couch2    pulls from node2-couch1
>
> No proxies are involved. In our staging system, all servers are on the same 
> subnet.
>
> The problem is that every night, the entire cluster dies. All instances of 
> CouchDB crash, and moreover they crash exactly simultaneously.
>
> The data being replicated is very minimal at the moment - simple log text 
> lines, no attachments. The entire database being replicated is no more than a 
> few megabytes in size.
>
> The syslogs give no clue. The CouchDB logs are difficult to interpret unless 
> you are an Erlang programmer. If anyone would care to look at them, just let 
> me know.
>
> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>
> We are planning to build quite sophisticated transcluster job queue 
> functionality on top of CouchDB, but of course a situation like this suggests 
> that CouchDB replication currently is too unreliable to be of practical use, 
> unless this is a known bug and/or a fixed one.
>
> Any pointers or ideas are most welcome.
>
>        / Peter Bengtson
>
>
>

Reply via email to