The amount of logged data on the six servers is vast, but this is the crash
message on node0-couch1. It's perhaps easier if I make the full log files
available (give me a shout). Here's the snippet:
[Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server
<0.2092.0> terminating
** Last message in was {ibrowse_async_response,
{1267,713465,777255},
{error,connection_closed}}
** When Server state == {state,nil,nil,
[<0.2077.0>,
{http_db,
"http://couch2.staging.diino.com:5984/laplace_conf_staging/",
[{"User-Agent","CouchDB/0.10.1"},
{"Accept","application/json"},
{"Accept-Encoding","gzip"}],
[],get,nil,
[{response_format,binary},
{inactivity_timeout,30000}],
10,500,nil},
251,
[{<<"continuous">>,true},
{<<"source">>,
<<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>},
{<<"target">>,
<<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]],
251,<0.2093.0>,
{1267,713465,777255},
false,0,<<>>,
{<0.2095.0>,#Ref<0.0.0.131534>},
** Reason for termination ==
** {error,connection_closed}
[Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server
<0.2130.0> terminating
** Last message in was {ibrowse_async_response,
{1267,713465,843079},
{error,connection_closed}}
** When Server state == {state,nil,nil,
[<0.2106.0>,
{http_db,
"http://couch2.staging.diino.com:5984/laplace_log_staging/",
[{"User-Agent","CouchDB/0.10.1"},
{"Accept","application/json"},
{"Accept-Encoding","gzip"}],
[],get,nil,
[{response_format,binary},
{inactivity_timeout,30000}],
10,500,nil},
28136,
[{<<"continuous">>,true},
{<<"source">>,
<<"http://couch2.staging.diino.com:5984/laplace_log_staging">>},
{<<"target">>,
<<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]],
29086,<0.2131.0>,
{1267,713465,843079},
false,0,<<>>,
{<0.2133.0>,#Ref<0.0.5.183681>},
** Reason for termination ==
** {error,connection_closed}
On 5 mar 2010, at 13.44, Robert Newson wrote:
> Can you include some of the log output?
>
> A coordinated failure like this points to external factors but log
> output will help in any case.
>
> B.
>
> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]>
> wrote:
>> We have a cluster of servers. At the moment there are three servers, each
>> having two separate instances of CouchDB, like this:
>>
>> node0-couch1
>> node0-couch2
>>
>> node1-couch1
>> node1-couch2
>>
>> node2-couch1
>> node2-couch2
>>
>> All couch1 instances are set up to replicate continuously using
>> bidirectional pull replication. That is:
>>
>> node0-couch1 pulls from node1-couch1 and node2-couch1
>> node1-couch1 pulls from node0-couch1 and node2-couch1
>> node2-couch1 pulls from node0-couch1 and node1-couch1
>>
>> On each node, couch1 and couch2 are set up to replicate each other
>> continuously, again using pull replication. Thus, the full replication
>> topology is:
>>
>> node0-couch1 pulls from node1-couch, node2-couch1, and node0-couch2
>> node0-couch2 pulls from node0-couch1
>>
>> node1-couch1 pulls from node0-couch1, node2-couch1, and
>> node1-couch2
>> node1-couch2 pulls from node1-couch1
>>
>> node2-couch1 pulls from node0-couch1, node1-couch1, and
>> node2-couch2
>> node2-couch2 pulls from node2-couch1
>>
>> No proxies are involved. In our staging system, all servers are on the same
>> subnet.
>>
>> The problem is that every night, the entire cluster dies. All instances of
>> CouchDB crash, and moreover they crash exactly simultaneously.
>>
>> The data being replicated is very minimal at the moment - simple log text
>> lines, no attachments. The entire database being replicated is no more than
>> a few megabytes in size.
>>
>> The syslogs give no clue. The CouchDB logs are difficult to interpret unless
>> you are an Erlang programmer. If anyone would care to look at them, just let
>> me know.
>>
>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>
>> We are planning to build quite sophisticated transcluster job queue
>> functionality on top of CouchDB, but of course a situation like this
>> suggests that CouchDB replication currently is too unreliable to be of
>> practical use, unless this is a known bug and/or a fixed one.
>>
>> Any pointers or ideas are most welcome.
>>
>> / Peter Bengtson
>>
>>
>>