Re: Entire CouchDB cluster crashes simultaneously

Adam Kocoloski Fri, 05 Mar 2010 06:57:58 -0800

From that log we can tell that CouchDB crashed completely on node0-couch2 
(because of the "Apache CouchDB has started .." message).  The crashes 
indicating a timeout on couch_server:open are troubling.  I've usually only 
seen that when a system is way overloaded, although it could also happen if you 
try to open a large number of previously-unopened DBs simultaneously.


Adam

On Mar 5, 2010, at 8:29 AM, Peter Bengtson wrote:

> It seems as if only the replication tasks crash, as the rest of the CouchDB 
> functionality still seems to be online, or, alternatively, is restarted so 
> that it appears that way.
> 
> This is what happens on the node0-couch2 at the time of the error. There 
> seems to be a lot of disconnected sockets:
> 
> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
>    {<0.63.0>,std_error,
>     {mochiweb_socket_server,235,
>         {child_error,{case_clause,{error,enotconn}}}}}}
> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.22982.2>] {error_report,<0.24.0>,
>    {<0.22982.2>,crash_report,
>     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>       {pid,<0.22982.2>},
>       {registered_name,[]},
>       {error_info,
>           {error,
>               {case_clause,{error,enotconn}},
>               [{mochiweb_request,get,2},
>                {couch_httpd,handle_request,5},
>                {mochiweb_http,headers,5},
>                {proc_lib,init_p_do_apply,3}]}},
>       {ancestors,
>           [couch_httpd,couch_secondary_services,couch_server_sup,<0.2.0>]},
>       {messages,[]},
>       {links,[<0.63.0>,#Port<0.34758>]},
>       {dictionary,[{mochiweb_request_qs,[]},{jsonp,undefined}]},
>       {trap_exit,false},
>       {status,running},
>       {heap_size,2584},
>       {stack_size,24},
>       {reductions,2164}],
> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
>    {<0.63.0>,std_error,
>     {mochiweb_socket_server,235,
>         {child_error,{case_clause,{error,enotconn}}}}}}
> [Fri, 05 Mar 2010 04:55:32 GMT] [info] [<0.2.0>] Apache CouchDB has started 
> on http://0.0.0.0:5984/
> [Fri, 05 Mar 2010 04:55:50 GMT] [error] [<0.82.0>] Uncaught error in HTTP 
> request: {exit,
>                                 {timeout,
>                                  {gen_server,call,
>                                   [couch_server,
>                                    {open,<<"laplace_log_staging">>,
>                                     [{user_ctx,
>                                       {user_ctx,null,[<<"_admin">>]}}]}]}}}
> [Fri, 05 Mar 2010 04:55:50 GMT] [info] [<0.82.0>] Stacktrace: 
> [{gen_server,call,2},
>             {couch_server,open,2},
>             {couch_httpd_db,do_db_req,2},
>             {couch_httpd,handle_request,5},
>             {mochiweb_http,headers,5},
>             {proc_lib,init_p_do_apply,3}]
> [Fri, 05 Mar 2010 04:56:24 GMT] [info] [<0.2.0>] Apache CouchDB has started 
> on http://0.0.0.0:5984/
> [Fri, 05 Mar 2010 04:56:26 GMT] [error] [<0.66.0>] Uncaught error in HTTP 
> request: {exit,normal}
> [Fri, 05 Mar 2010 04:56:26 GMT] [info] [<0.66.0>] Stacktrace: 
> [{mochiweb_request,send,2},
>             {mochiweb_request,respond,2},
>             {couch_httpd,send_response,4},
>             {couch_httpd,handle_request,5},
>             {mochiweb_http,headers,5},
>             {proc_lib,init_p_do_apply,3}]
> [Fri, 05 Mar 2010 05:25:37 GMT] [error] [<0.2694.0>] Uncaught error in HTTP 
> request: {exit,
>                                 {timeout,
>                                  {gen_server,call,
>                                   [couch_server,
>                                    {open,<<"laplace_log_staging">>,
>                                     [{user_ctx,
>                                       {user_ctx,null,[<<"_admin">>]}}]}]}}}
> [Fri, 05 Mar 2010 05:26:00 GMT] [info] [<0.2.0>] Apache CouchDB has started 
> on http://0.0.0.0:5984/
> 
> 
> 
> On 5 mar 2010, at 14.22, Robert Newson wrote:
> 
>> Is couchdb crashing or just the replication tasks?
>> 
>> On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <[email protected]> 
>> wrote:
>>> The amount of logged data on the six servers is vast, but this is the crash 
>>> message on node0-couch1. It's perhaps easier if I make the full log files 
>>> available (give me a shout). Here's the snippet:
>>> 
>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server 
>>> <0.2092.0> terminating
>>> ** Last message in was {ibrowse_async_response,
>>>                          {1267,713465,777255},
>>>                          {error,connection_closed}}
>>> ** When Server state == {state,nil,nil,
>>>                           [<0.2077.0>,
>>>                            {http_db,
>>>                                
>>> "http://couch2.staging.diino.com:5984/laplace_conf_staging/";,
>>>                                [{"User-Agent","CouchDB/0.10.1"},
>>>                                 {"Accept","application/json"},
>>>                                 {"Accept-Encoding","gzip"}],
>>>                                [],get,nil,
>>>                                [{response_format,binary},
>>>                                 {inactivity_timeout,30000}],
>>>                                10,500,nil},
>>>                            251,
>>>                            [{<<"continuous">>,true},
>>>                             {<<"source">>,
>>>                              
>>> <<"http://couch2.staging.diino.com:5984/laplace_conf_staging";>>},
>>>                             {<<"target">>,
>>>                              
>>> <<"http://couch1.staging.diino.com:5984/laplace_conf_staging";>>}]],
>>>                           251,<0.2093.0>,
>>>                           {1267,713465,777255},
>>>                           false,0,<<>>,
>>>                           {<0.2095.0>,#Ref<0.0.0.131534>},
>>> ** Reason for termination ==
>>> ** {error,connection_closed}
>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server 
>>> <0.2130.0> terminating
>>> ** Last message in was {ibrowse_async_response,
>>>                          {1267,713465,843079},
>>>                          {error,connection_closed}}
>>> ** When Server state == {state,nil,nil,
>>>                           [<0.2106.0>,
>>>                            {http_db,
>>>                                
>>> "http://couch2.staging.diino.com:5984/laplace_log_staging/";,
>>>                                [{"User-Agent","CouchDB/0.10.1"},
>>>                                 {"Accept","application/json"},
>>>                                 {"Accept-Encoding","gzip"}],
>>>                                [],get,nil,
>>>                                [{response_format,binary},
>>>                                 {inactivity_timeout,30000}],
>>>                                10,500,nil},
>>>                            28136,
>>>                            [{<<"continuous">>,true},
>>>                             {<<"source">>,
>>>                              
>>> <<"http://couch2.staging.diino.com:5984/laplace_log_staging";>>},
>>>                             {<<"target">>,
>>>                              
>>> <<"http://couch1.staging.diino.com:5984/laplace_log_staging";>>}]],
>>>                           29086,<0.2131.0>,
>>>                           {1267,713465,843079},
>>>                           false,0,<<>>,
>>>                           {<0.2133.0>,#Ref<0.0.5.183681>},
>>> ** Reason for termination ==
>>> ** {error,connection_closed}
>>> 
>>> 
>>> 
>>> On 5 mar 2010, at 13.44, Robert Newson wrote:
>>> 
>>>> Can you include some of the log output?
>>>> 
>>>> A coordinated failure like this points to external factors but log
>>>> output will help in any case.
>>>> 
>>>> B.
>>>> 
>>>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]> 
>>>> wrote:
>>>>> We have a cluster of servers. At the moment there are three servers, each 
>>>>> having two separate instances of CouchDB, like this:
>>>>> 
>>>>>       node0-couch1
>>>>>       node0-couch2
>>>>> 
>>>>>       node1-couch1
>>>>>       node1-couch2
>>>>> 
>>>>>       node2-couch1
>>>>>       node2-couch2
>>>>> 
>>>>> All couch1 instances are set up to replicate continuously using 
>>>>> bidirectional pull replication. That is:
>>>>> 
>>>>>       node0-couch1    pulls from node1-couch1 and node2-couch1
>>>>>       node1-couch1    pulls from node0-couch1 and node2-couch1
>>>>>       node2-couch1    pulls from node0-couch1 and node1-couch1
>>>>> 
>>>>> On each node, couch1 and couch2 are set up to replicate each other 
>>>>> continuously, again using pull replication. Thus, the full replication 
>>>>> topology is:
>>>>> 
>>>>>       node0-couch1    pulls from node1-couch, node2-couch1, and 
>>>>> node0-couch2
>>>>>       node0-couch2    pulls from node0-couch1
>>>>> 
>>>>>       node1-couch1    pulls from node0-couch1, node2-couch1, and 
>>>>> node1-couch2
>>>>>       node1-couch2    pulls from node1-couch1
>>>>> 
>>>>>       node2-couch1    pulls from node0-couch1, node1-couch1, and 
>>>>> node2-couch2
>>>>>       node2-couch2    pulls from node2-couch1
>>>>> 
>>>>> No proxies are involved. In our staging system, all servers are on the 
>>>>> same subnet.
>>>>> 
>>>>> The problem is that every night, the entire cluster dies. All instances 
>>>>> of CouchDB crash, and moreover they crash exactly simultaneously.
>>>>> 
>>>>> The data being replicated is very minimal at the moment - simple log text 
>>>>> lines, no attachments. The entire database being replicated is no more 
>>>>> than a few megabytes in size.
>>>>> 
>>>>> The syslogs give no clue. The CouchDB logs are difficult to interpret 
>>>>> unless you are an Erlang programmer. If anyone would care to look at 
>>>>> them, just let me know.
>>>>> 
>>>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>>> 
>>>>> We are planning to build quite sophisticated transcluster job queue 
>>>>> functionality on top of CouchDB, but of course a situation like this 
>>>>> suggests that CouchDB replication currently is too unreliable to be of 
>>>>> practical use, unless this is a known bug and/or a fixed one.
>>>>> 
>>>>> Any pointers or ideas are most welcome.
>>>>> 
>>>>>       / Peter Bengtson
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Entire CouchDB cluster crashes simultaneously

Reply via email to