Re: Entire CouchDB cluster crashes simultaneously

Peter Bengtson Fri, 05 Mar 2010 05:29:59 -0800

It seems as if only the replication tasks crash, as the rest of the CouchDB 
functionality still seems to be online, or, alternatively, is restarted so that 
it appears that way.


This is what happens on the node0-couch2 at the time of the error. There seems 
to be a lot of disconnected sockets:

[Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
    {<0.63.0>,std_error,
     {mochiweb_socket_server,235,
         {child_error,{case_clause,{error,enotconn}}}}}}
[Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.22982.2>] {error_report,<0.24.0>,
    {<0.22982.2>,crash_report,
     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
       {pid,<0.22982.2>},
       {registered_name,[]},
       {error_info,
           {error,
               {case_clause,{error,enotconn}},
               [{mochiweb_request,get,2},
                {couch_httpd,handle_request,5},
                {mochiweb_http,headers,5},
                {proc_lib,init_p_do_apply,3}]}},
       {ancestors,
           [couch_httpd,couch_secondary_services,couch_server_sup,<0.2.0>]},
       {messages,[]},
       {links,[<0.63.0>,#Port<0.34758>]},
       {dictionary,[{mochiweb_request_qs,[]},{jsonp,undefined}]},
       {trap_exit,false},
       {status,running},
       {heap_size,2584},
       {stack_size,24},
       {reductions,2164}],
[Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
    {<0.63.0>,std_error,
     {mochiweb_socket_server,235,
         {child_error,{case_clause,{error,enotconn}}}}}}
[Fri, 05 Mar 2010 04:55:32 GMT] [info] [<0.2.0>] Apache CouchDB has started on 
http://0.0.0.0:5984/
[Fri, 05 Mar 2010 04:55:50 GMT] [error] [<0.82.0>] Uncaught error in HTTP 
request: {exit,
                                 {timeout,
                                  {gen_server,call,
                                   [couch_server,
                                    {open,<<"laplace_log_staging">>,
                                     [{user_ctx,
                                       {user_ctx,null,[<<"_admin">>]}}]}]}}}
[Fri, 05 Mar 2010 04:55:50 GMT] [info] [<0.82.0>] Stacktrace: 
[{gen_server,call,2},
             {couch_server,open,2},
             {couch_httpd_db,do_db_req,2},
             {couch_httpd,handle_request,5},
             {mochiweb_http,headers,5},
             {proc_lib,init_p_do_apply,3}]
[Fri, 05 Mar 2010 04:56:24 GMT] [info] [<0.2.0>] Apache CouchDB has started on 
http://0.0.0.0:5984/
[Fri, 05 Mar 2010 04:56:26 GMT] [error] [<0.66.0>] Uncaught error in HTTP 
request: {exit,normal}
[Fri, 05 Mar 2010 04:56:26 GMT] [info] [<0.66.0>] Stacktrace: 
[{mochiweb_request,send,2},
             {mochiweb_request,respond,2},
             {couch_httpd,send_response,4},
             {couch_httpd,handle_request,5},
             {mochiweb_http,headers,5},
             {proc_lib,init_p_do_apply,3}]
[Fri, 05 Mar 2010 05:25:37 GMT] [error] [<0.2694.0>] Uncaught error in HTTP 
request: {exit,
                                 {timeout,
                                  {gen_server,call,
                                   [couch_server,
                                    {open,<<"laplace_log_staging">>,
                                     [{user_ctx,
                                       {user_ctx,null,[<<"_admin">>]}}]}]}}}
[Fri, 05 Mar 2010 05:26:00 GMT] [info] [<0.2.0>] Apache CouchDB has started on 
http://0.0.0.0:5984/



On 5 mar 2010, at 14.22, Robert Newson wrote:

> Is couchdb crashing or just the replication tasks?
> 
> On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <[email protected]> 
> wrote:
>> The amount of logged data on the six servers is vast, but this is the crash 
>> message on node0-couch1. It's perhaps easier if I make the full log files 
>> available (give me a shout). Here's the snippet:
>> 
>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server 
>> <0.2092.0> terminating
>> ** Last message in was {ibrowse_async_response,
>>                           {1267,713465,777255},
>>                           {error,connection_closed}}
>> ** When Server state == {state,nil,nil,
>>                            [<0.2077.0>,
>>                             {http_db,
>>                                 
>> "http://couch2.staging.diino.com:5984/laplace_conf_staging/";,
>>                                 [{"User-Agent","CouchDB/0.10.1"},
>>                                  {"Accept","application/json"},
>>                                  {"Accept-Encoding","gzip"}],
>>                                 [],get,nil,
>>                                 [{response_format,binary},
>>                                  {inactivity_timeout,30000}],
>>                                 10,500,nil},
>>                             251,
>>                             [{<<"continuous">>,true},
>>                              {<<"source">>,
>>                               
>> <<"http://couch2.staging.diino.com:5984/laplace_conf_staging";>>},
>>                              {<<"target">>,
>>                               
>> <<"http://couch1.staging.diino.com:5984/laplace_conf_staging";>>}]],
>>                            251,<0.2093.0>,
>>                            {1267,713465,777255},
>>                            false,0,<<>>,
>>                            {<0.2095.0>,#Ref<0.0.0.131534>},
>> ** Reason for termination ==
>> ** {error,connection_closed}
>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server 
>> <0.2130.0> terminating
>> ** Last message in was {ibrowse_async_response,
>>                           {1267,713465,843079},
>>                           {error,connection_closed}}
>> ** When Server state == {state,nil,nil,
>>                            [<0.2106.0>,
>>                             {http_db,
>>                                 
>> "http://couch2.staging.diino.com:5984/laplace_log_staging/";,
>>                                 [{"User-Agent","CouchDB/0.10.1"},
>>                                  {"Accept","application/json"},
>>                                  {"Accept-Encoding","gzip"}],
>>                                 [],get,nil,
>>                                 [{response_format,binary},
>>                                  {inactivity_timeout,30000}],
>>                                 10,500,nil},
>>                             28136,
>>                             [{<<"continuous">>,true},
>>                              {<<"source">>,
>>                               
>> <<"http://couch2.staging.diino.com:5984/laplace_log_staging";>>},
>>                              {<<"target">>,
>>                               
>> <<"http://couch1.staging.diino.com:5984/laplace_log_staging";>>}]],
>>                            29086,<0.2131.0>,
>>                            {1267,713465,843079},
>>                            false,0,<<>>,
>>                            {<0.2133.0>,#Ref<0.0.5.183681>},
>> ** Reason for termination ==
>> ** {error,connection_closed}
>> 
>> 
>> 
>> On 5 mar 2010, at 13.44, Robert Newson wrote:
>> 
>>> Can you include some of the log output?
>>> 
>>> A coordinated failure like this points to external factors but log
>>> output will help in any case.
>>> 
>>> B.
>>> 
>>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <[email protected]> 
>>> wrote:
>>>> We have a cluster of servers. At the moment there are three servers, each 
>>>> having two separate instances of CouchDB, like this:
>>>> 
>>>>        node0-couch1
>>>>        node0-couch2
>>>> 
>>>>        node1-couch1
>>>>        node1-couch2
>>>> 
>>>>        node2-couch1
>>>>        node2-couch2
>>>> 
>>>> All couch1 instances are set up to replicate continuously using 
>>>> bidirectional pull replication. That is:
>>>> 
>>>>        node0-couch1    pulls from node1-couch1 and node2-couch1
>>>>        node1-couch1    pulls from node0-couch1 and node2-couch1
>>>>        node2-couch1    pulls from node0-couch1 and node1-couch1
>>>> 
>>>> On each node, couch1 and couch2 are set up to replicate each other 
>>>> continuously, again using pull replication. Thus, the full replication 
>>>> topology is:
>>>> 
>>>>        node0-couch1    pulls from node1-couch, node2-couch1, and 
>>>> node0-couch2
>>>>        node0-couch2    pulls from node0-couch1
>>>> 
>>>>        node1-couch1    pulls from node0-couch1, node2-couch1, and 
>>>> node1-couch2
>>>>        node1-couch2    pulls from node1-couch1
>>>> 
>>>>        node2-couch1    pulls from node0-couch1, node1-couch1, and 
>>>> node2-couch2
>>>>        node2-couch2    pulls from node2-couch1
>>>> 
>>>> No proxies are involved. In our staging system, all servers are on the 
>>>> same subnet.
>>>> 
>>>> The problem is that every night, the entire cluster dies. All instances of 
>>>> CouchDB crash, and moreover they crash exactly simultaneously.
>>>> 
>>>> The data being replicated is very minimal at the moment - simple log text 
>>>> lines, no attachments. The entire database being replicated is no more 
>>>> than a few megabytes in size.
>>>> 
>>>> The syslogs give no clue. The CouchDB logs are difficult to interpret 
>>>> unless you are an Erlang programmer. If anyone would care to look at them, 
>>>> just let me know.
>>>> 
>>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>> 
>>>> We are planning to build quite sophisticated transcluster job queue 
>>>> functionality on top of CouchDB, but of course a situation like this 
>>>> suggests that CouchDB replication currently is too unreliable to be of 
>>>> practical use, unless this is a known bug and/or a fixed one.
>>>> 
>>>> Any pointers or ideas are most welcome.
>>>> 
>>>>        / Peter Bengtson
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Entire CouchDB cluster crashes simultaneously

Reply via email to