From that log we can tell that CouchDB crashed completely on node0-couch2
(because of the "Apache CouchDB has started .." message). The crashes
indicating a timeout on couch_server:open are troubling. I've usually only
seen that when a system is way overloaded, although it could also happen if you
try to open a large number of previously-unopened DBs simultaneously.
Adam
On Mar 5, 2010, at 8:29 AM, Peter Bengtson wrote:
> It seems as if only the replication tasks crash, as the rest of the CouchDB
> functionality still seems to be online, or, alternatively, is restarted so
> that it appears that way.
>
> This is what happens on the node0-couch2 at the time of the error. There
> seems to be a lot of disconnected sockets:
>
> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
> {<0.63.0>,std_error,
> {mochiweb_socket_server,235,
> {child_error,{case_clause,{error,enotconn}}}}}}
> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.22982.2>] {error_report,<0.24.0>,
> {<0.22982.2>,crash_report,
> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
> {pid,<0.22982.2>},
> {registered_name,[]},
> {error_info,
> {error,
> {case_clause,{error,enotconn}},
> [{mochiweb_request,get,2},
> {couch_httpd,handle_request,5},
> {mochiweb_http,headers,5},
> {proc_lib,init_p_do_apply,3}]}},
> {ancestors,
> [couch_httpd,couch_secondary_services,couch_server_sup,<0.2.0>]},
> {messages,[]},
> {links,[<0.63.0>,#Port<0.34758>]},
> {dictionary,[{mochiweb_request_qs,[]},{jsonp,undefined}]},
> {trap_exit,false},
> {status,running},
> {heap_size,2584},
> {stack_size,24},
> {reductions,2164}],
> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
> {<0.63.0>,std_error,
> {mochiweb_socket_server,235,
> {child_error,{case_clause,{error,enotconn}}}}}}
> [Fri, 05 Mar 2010 04:55:32 GMT] [info] [<0.2.0>] Apache CouchDB has started
> on http://0.0.0.0:5984/
> [Fri, 05 Mar 2010 04:55:50 GMT] [error] [<0.82.0>] Uncaught error in HTTP
> request: {exit,
> {timeout,
> {gen_server,call,
> [couch_server,
> {open,<<"laplace_log_staging">>,
> [{user_ctx,
> {user_ctx,null,[<<"_admin">>]}}]}]}}}
> [Fri, 05 Mar 2010 04:55:50 GMT] [info] [<0.82.0>] Stacktrace:
> [{gen_server,call,2},
> {couch_server,open,2},
> {couch_httpd_db,do_db_req,2},
> {couch_httpd,handle_request,5},
> {mochiweb_http,headers,5},
> {proc_lib,init_p_do_apply,3}]
> [Fri, 05 Mar 2010 04:56:24 GMT] [info] [<0.2.0>] Apache CouchDB has started
> on http://0.0.0.0:5984/
> [Fri, 05 Mar 2010 04:56:26 GMT] [error] [<0.66.0>] Uncaught error in HTTP
> request: {exit,normal}
> [Fri, 05 Mar 2010 04:56:26 GMT] [info] [<0.66.0>] Stacktrace:
> [{mochiweb_request,send,2},
> {mochiweb_request,respond,2},
> {couch_httpd,send_response,4},
> {couch_httpd,handle_request,5},
> {mochiweb_http,headers,5},
> {proc_lib,init_p_do_apply,3}]
> [Fri, 05 Mar 2010 05:25:37 GMT] [error] [<0.2694.0>] Uncaught error in HTTP
> request: {exit,
> {timeout,
> {gen_server,call,
> [couch_server,
> {open,<<"laplace_log_staging">>,
> [{user_ctx,
> {user_ctx,null,[<<"_admin">>]}}]}]}}}
> [Fri, 05 Mar 2010 05:26:00 GMT] [info] [<0.2.0>] Apache CouchDB has started
> on http://0.0.0.0:5984/
>
>
>
> On 5 mar 2010, at 14.22, Robert Newson wrote:
>
>> Is couchdb crashing or just the replication tasks?
>>
>> On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <[email protected]>
>> wrote:
>>> The amount of logged data on the six servers is vast, but this is the crash
>>> message on node0-couch1. It's perhaps easier if I make the full log files
>>> available (give me a shout). Here's the snippet:
>>>
>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server
>>> <0.2092.0> terminating
>>> ** Last message in was {ibrowse_async_response,
>>> {1267,713465,777255},
>>> {error,connection_closed}}
>>> ** When Server state == {state,nil,nil,
>>> [<0.2077.0>,
>>> {http_db,
>>>
>>> "http://couch2.staging.diino.com:5984/laplace_conf_staging/",
>>> [{"User-Agent","CouchDB/0.10.1"},
>>> {"Accept","application/json"},
>>> {"Accept-Encoding","gzip"}],
>>> [],get,nil,
>>> [{response_format,binary},
>>> {inactivity_timeout,30000}],
>>> 10,500,nil},
>>> 251,
>>> [{<<"continuous">>,true},
>>> {<<"source">>,
>>>
>>> <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>},
>>> {<<"target">>,
>>>
>>> <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]],
>>> 251,<0.2093.0>,
>>> {1267,713465,777255},
>>> false,0,<<>>,
>>> {<0.2095.0>,#Ref<0.0.0.131534>},
>>> ** Reason for termination ==
>>> ** {error,connection_closed}
>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server
>>> <0.2130.0> terminating
>>> ** Last message in was {ibrowse_async_response,
>>> {1267,713465,843079},
>>> {error,connection_closed}}
>>> ** When Server state == {state,nil,nil,
>>> [<0.2106.0>,
>>> {http_db,
>>>
>>> "http://couch2.staging.diino.com:5984/laplace_log_staging/",
>>> [{"User-Agent","CouchDB/0.10.1"},
>>> {"Accept","application/json"},
>>> {"Accept-Encoding","gzip"}],
>>> [],get,nil,
>>> [{response_format,binary},
>>> {inactivity_timeout,30000}],
>>> 10,500,nil},
>>> 28136,
>>> [{<<"continuous">>,true},
>>> {<<"source">>,
>>>
>>> <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>},
>>> {<<"target">>,
>>>
>>> <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]],
>>> 29086,<0.2131.0>,
>>> {1267,713465,843079},
>>> false,0,<<>>,
>>> {<0.2133.0>,#Ref<0.0.5.183681>},
>>> ** Reason for termination ==
>>> ** {error,connection_closed}
>>>
>>>
>>>
>>> On 5 mar 2010, at 13.44, Robert Newson wrote:
>>>
>>>> Can you include some of the log output?
>>>>
>>>> A coordinated failure like this points to external factors but log
>>>> output will help in any case.
>>>>
>>>> B.
>>>>
>>>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]>
>>>> wrote:
>>>>> We have a cluster of servers. At the moment there are three servers, each
>>>>> having two separate instances of CouchDB, like this:
>>>>>
>>>>> node0-couch1
>>>>> node0-couch2
>>>>>
>>>>> node1-couch1
>>>>> node1-couch2
>>>>>
>>>>> node2-couch1
>>>>> node2-couch2
>>>>>
>>>>> All couch1 instances are set up to replicate continuously using
>>>>> bidirectional pull replication. That is:
>>>>>
>>>>> node0-couch1 pulls from node1-couch1 and node2-couch1
>>>>> node1-couch1 pulls from node0-couch1 and node2-couch1
>>>>> node2-couch1 pulls from node0-couch1 and node1-couch1
>>>>>
>>>>> On each node, couch1 and couch2 are set up to replicate each other
>>>>> continuously, again using pull replication. Thus, the full replication
>>>>> topology is:
>>>>>
>>>>> node0-couch1 pulls from node1-couch, node2-couch1, and
>>>>> node0-couch2
>>>>> node0-couch2 pulls from node0-couch1
>>>>>
>>>>> node1-couch1 pulls from node0-couch1, node2-couch1, and
>>>>> node1-couch2
>>>>> node1-couch2 pulls from node1-couch1
>>>>>
>>>>> node2-couch1 pulls from node0-couch1, node1-couch1, and
>>>>> node2-couch2
>>>>> node2-couch2 pulls from node2-couch1
>>>>>
>>>>> No proxies are involved. In our staging system, all servers are on the
>>>>> same subnet.
>>>>>
>>>>> The problem is that every night, the entire cluster dies. All instances
>>>>> of CouchDB crash, and moreover they crash exactly simultaneously.
>>>>>
>>>>> The data being replicated is very minimal at the moment - simple log text
>>>>> lines, no attachments. The entire database being replicated is no more
>>>>> than a few megabytes in size.
>>>>>
>>>>> The syslogs give no clue. The CouchDB logs are difficult to interpret
>>>>> unless you are an Erlang programmer. If anyone would care to look at
>>>>> them, just let me know.
>>>>>
>>>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>>>
>>>>> We are planning to build quite sophisticated transcluster job queue
>>>>> functionality on top of CouchDB, but of course a situation like this
>>>>> suggests that CouchDB replication currently is too unreliable to be of
>>>>> practical use, unless this is a known bug and/or a fixed one.
>>>>>
>>>>> Any pointers or ideas are most welcome.
>>>>>
>>>>> / Peter Bengtson
>>>>>
>>>>>
>>>>>
>>>
>>>
>