Adam, that's interesting. These crashes occur every night with alarming
regularity, but the staging system on which this runs is under no load to speak
about. And there are only two DBs in the system at this point, both of which
were opened at least 12 hours earlier. I'll ask our sysadmins to double-check
the load, but I'd like to know one thing:
Why do these crashes occur system-wide? On three nodes and six servers? And at
the same time? Somehow, we didn't quite expect that CouchDB should go quite so
far as to replicate the crashes... ;-)
/ Peter
5 mar 2010 kl. 15.57 skrev Adam Kocoloski:
> From that log we can tell that CouchDB crashed completely on node0-couch2
> (because of the "Apache CouchDB has started .." message). The crashes
> indicating a timeout on couch_server:open are troubling. I've usually only
> seen that when a system is way overloaded, although it could also happen if
> you try to open a large number of previously-unopened DBs simultaneously.
>
> Adam
>
> On Mar 5, 2010, at 8:29 AM, Peter Bengtson wrote:
>
>> It seems as if only the replication tasks crash, as the rest of the CouchDB
>> functionality still seems to be online, or, alternatively, is restarted so
>> that it appears that way.
>>
>> This is what happens on the node0-couch2 at the time of the error. There
>> seems to be a lot of disconnected sockets:
>>
>> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
>> {<0.63.0>,std_error,
>> {mochiweb_socket_server,235,
>> {child_error,{case_clause,{error,enotconn}}}}}}
>> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.22982.2>] {error_report,<0.24.0>,
>> {<0.22982.2>,crash_report,
>> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>> {pid,<0.22982.2>},
>> {registered_name,[]},
>> {error_info,
>> {error,
>> {case_clause,{error,enotconn}},
>> [{mochiweb_request,get,2},
>> {couch_httpd,handle_request,5},
>> {mochiweb_http,headers,5},
>> {proc_lib,init_p_do_apply,3}]}},
>> {ancestors,
>> [couch_httpd,couch_secondary_services,couch_server_sup,<0.2.0>]},
>> {messages,[]},
>> {links,[<0.63.0>,#Port<0.34758>]},
>> {dictionary,[{mochiweb_request_qs,[]},{jsonp,undefined}]},
>> {trap_exit,false},
>> {status,running},
>> {heap_size,2584},
>> {stack_size,24},
>> {reductions,2164}],
>> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
>> {<0.63.0>,std_error,
>> {mochiweb_socket_server,235,
>> {child_error,{case_clause,{error,enotconn}}}}}}
>> [Fri, 05 Mar 2010 04:55:32 GMT] [info] [<0.2.0>] Apache CouchDB has started
>> on http://0.0.0.0:5984/
>> [Fri, 05 Mar 2010 04:55:50 GMT] [error] [<0.82.0>] Uncaught error in HTTP
>> request: {exit,
>> {timeout,
>> {gen_server,call,
>> [couch_server,
>> {open,<<"laplace_log_staging">>,
>> [{user_ctx,
>> {user_ctx,null,[<<"_admin">>]}}]}]}}}
>> [Fri, 05 Mar 2010 04:55:50 GMT] [info] [<0.82.0>] Stacktrace:
>> [{gen_server,call,2},
>> {couch_server,open,2},
>> {couch_httpd_db,do_db_req,2},
>> {couch_httpd,handle_request,5},
>> {mochiweb_http,headers,5},
>> {proc_lib,init_p_do_apply,3}]
>> [Fri, 05 Mar 2010 04:56:24 GMT] [info] [<0.2.0>] Apache CouchDB has started
>> on http://0.0.0.0:5984/
>> [Fri, 05 Mar 2010 04:56:26 GMT] [error] [<0.66.0>] Uncaught error in HTTP
>> request: {exit,normal}
>> [Fri, 05 Mar 2010 04:56:26 GMT] [info] [<0.66.0>] Stacktrace:
>> [{mochiweb_request,send,2},
>> {mochiweb_request,respond,2},
>> {couch_httpd,send_response,4},
>> {couch_httpd,handle_request,5},
>> {mochiweb_http,headers,5},
>> {proc_lib,init_p_do_apply,3}]
>> [Fri, 05 Mar 2010 05:25:37 GMT] [error] [<0.2694.0>] Uncaught error in HTTP
>> request: {exit,
>> {timeout,
>> {gen_server,call,
>> [couch_server,
>> {open,<<"laplace_log_staging">>,
>> [{user_ctx,
>> {user_ctx,null,[<<"_admin">>]}}]}]}}}
>> [Fri, 05 Mar 2010 05:26:00 GMT] [info] [<0.2.0>] Apache CouchDB has started
>> on http://0.0.0.0:5984/
>>
>>
>>
>> On 5 mar 2010, at 14.22, Robert Newson wrote:
>>
>>> Is couchdb crashing or just the replication tasks?
>>>
>>> On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <[email protected]>
>>> wrote:
>>>> The amount of logged data on the six servers is vast, but this is the
>>>> crash message on node0-couch1. It's perhaps easier if I make the full log
>>>> files available (give me a shout). Here's the snippet:
>>>>
>>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server
>>>> <0.2092.0> terminating
>>>> ** Last message in was {ibrowse_async_response,
>>>> {1267,713465,777255},
>>>> {error,connection_closed}}
>>>> ** When Server state == {state,nil,nil,
>>>> [<0.2077.0>,
>>>> {http_db,
>>>>
>>>> "http://couch2.staging.diino.com:5984/laplace_conf_staging/",
>>>> [{"User-Agent","CouchDB/0.10.1"},
>>>> {"Accept","application/json"},
>>>> {"Accept-Encoding","gzip"}],
>>>> [],get,nil,
>>>> [{response_format,binary},
>>>> {inactivity_timeout,30000}],
>>>> 10,500,nil},
>>>> 251,
>>>> [{<<"continuous">>,true},
>>>> {<<"source">>,
>>>>
>>>> <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>},
>>>> {<<"target">>,
>>>>
>>>> <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]],
>>>> 251,<0.2093.0>,
>>>> {1267,713465,777255},
>>>> false,0,<<>>,
>>>> {<0.2095.0>,#Ref<0.0.0.131534>},
>>>> ** Reason for termination ==
>>>> ** {error,connection_closed}
>>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server
>>>> <0.2130.0> terminating
>>>> ** Last message in was {ibrowse_async_response,
>>>> {1267,713465,843079},
>>>> {error,connection_closed}}
>>>> ** When Server state == {state,nil,nil,
>>>> [<0.2106.0>,
>>>> {http_db,
>>>>
>>>> "http://couch2.staging.diino.com:5984/laplace_log_staging/",
>>>> [{"User-Agent","CouchDB/0.10.1"},
>>>> {"Accept","application/json"},
>>>> {"Accept-Encoding","gzip"}],
>>>> [],get,nil,
>>>> [{response_format,binary},
>>>> {inactivity_timeout,30000}],
>>>> 10,500,nil},
>>>> 28136,
>>>> [{<<"continuous">>,true},
>>>> {<<"source">>,
>>>>
>>>> <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>},
>>>> {<<"target">>,
>>>>
>>>> <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]],
>>>> 29086,<0.2131.0>,
>>>> {1267,713465,843079},
>>>> false,0,<<>>,
>>>> {<0.2133.0>,#Ref<0.0.5.183681>},
>>>> ** Reason for termination ==
>>>> ** {error,connection_closed}
>>>>
>>>>
>>>>
>>>> On 5 mar 2010, at 13.44, Robert Newson wrote:
>>>>
>>>>> Can you include some of the log output?
>>>>>
>>>>> A coordinated failure like this points to external factors but log
>>>>> output will help in any case.
>>>>>
>>>>> B.
>>>>>
>>>>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]>
>>>>> wrote:
>>>>>> We have a cluster of servers. At the moment there are three servers,
>>>>>> each having two separate instances of CouchDB, like this:
>>>>>>
>>>>>> node0-couch1
>>>>>> node0-couch2
>>>>>>
>>>>>> node1-couch1
>>>>>> node1-couch2
>>>>>>
>>>>>> node2-couch1
>>>>>> node2-couch2
>>>>>>
>>>>>> All couch1 instances are set up to replicate continuously using
>>>>>> bidirectional pull replication. That is:
>>>>>>
>>>>>> node0-couch1 pulls from node1-couch1 and node2-couch1
>>>>>> node1-couch1 pulls from node0-couch1 and node2-couch1
>>>>>> node2-couch1 pulls from node0-couch1 and node1-couch1
>>>>>>
>>>>>> On each node, couch1 and couch2 are set up to replicate each other
>>>>>> continuously, again using pull replication. Thus, the full replication
>>>>>> topology is:
>>>>>>
>>>>>> node0-couch1 pulls from node1-couch, node2-couch1, and
>>>>>> node0-couch2
>>>>>> node0-couch2 pulls from node0-couch1
>>>>>>
>>>>>> node1-couch1 pulls from node0-couch1, node2-couch1, and
>>>>>> node1-couch2
>>>>>> node1-couch2 pulls from node1-couch1
>>>>>>
>>>>>> node2-couch1 pulls from node0-couch1, node1-couch1, and
>>>>>> node2-couch2
>>>>>> node2-couch2 pulls from node2-couch1
>>>>>>
>>>>>> No proxies are involved. In our staging system, all servers are on the
>>>>>> same subnet.
>>>>>>
>>>>>> The problem is that every night, the entire cluster dies. All instances
>>>>>> of CouchDB crash, and moreover they crash exactly simultaneously.
>>>>>>
>>>>>> The data being replicated is very minimal at the moment - simple log
>>>>>> text lines, no attachments. The entire database being replicated is no
>>>>>> more than a few megabytes in size.
>>>>>>
>>>>>> The syslogs give no clue. The CouchDB logs are difficult to interpret
>>>>>> unless you are an Erlang programmer. If anyone would care to look at
>>>>>> them, just let me know.
>>>>>>
>>>>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>>>>
>>>>>> We are planning to build quite sophisticated transcluster job queue
>>>>>> functionality on top of CouchDB, but of course a situation like this
>>>>>> suggests that CouchDB replication currently is too unreliable to be of
>>>>>> practical use, unless this is a known bug and/or a fixed one.
>>>>>>
>>>>>> Any pointers or ideas are most welcome.
>>>>>>
>>>>>> / Peter Bengtson
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>