Re: CouchDB 1.0.2 errors under load

Pasi Eronen Fri, 25 Mar 2011 01:59:53 -0700

I've finally managed to reproduce this problem in a test environment, and
filed a bug report:


https://issues.apache.org/jira/browse/COUCHDB-1100

Best regards,
Pasi

On Sat, Feb 26, 2011 at 11:19, Pasi Eronen <[email protected]> wrote:

> BTW, it seems just changing "ulimit -n" did not work. At some point,
> the server just stopped responding completely (even the root URL just
> times out, and there's nothing in the error log).
>
> A quick googling found references to enabling kernel poll (at build time
> and on erl command line), and increasing ERL_MAX_PORTS
> environment variable. Based on a quick look it seems kernel poll
> was already enabled, but I'll experiment with ERL_MAX_PORTS...
>
> Best regards,
> Pasi
>
>
>
> On Sat, Feb 26, 2011 at 03:40, Paul J. Davis 
> <[email protected]>wrote:
>
>>
>>
>> On Feb 25, 2011, at 7:31 PM, Pasi Eronen <[email protected]> wrote:
>>
>> Yes, the client is doing something funky :-) Many of the requests
>> it sends are POSTs with chunked transfer encoding, so it's being
>> hit hard by issue COUCHDB-682. In other words, there's lot of closing
>> and opening of TCP connections going on, and retrying requests that
>> failed due to COUCHDB-682.
>>
>> (The server behavior in COUCHDB-682 is probably strictly speaking not
>> violating the HTTP spec -- the server is allowed to close an idle
>> persistent
>> connection at any time -- but it's certainly causing unusual behavior
>> vs. opening and closing of TCP connections.)
>>
>> Output from netstat -anp looks like this:
>>
>> tcp  1  0  127.0.0.1:5984  127.0.0.1:51681  CLOSE_WAIT  19124/beam.smp
>> tcp  1  0  127.0.0.1:5984  127.0.0.1:36576  CLOSE_WAIT  19124/beam.smp
>> tcp  1  0  127.0.0.1:5984  127.0.0.1:35311  CLOSE_WAIT  19124/beam.smp
>> (...)
>>
>> The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no
>> longer
>> around (if they would be, I guess the client would also run out of fds --
>> and
>> the number of fds used by the client is not growing).
>>
>> Does this offer any clues for diagnosing the bug?
>>
>>
>> Its definitely pointing the finger at CouchDB/erlang.
>>
>> I can also try if I can reproduce the fd leak with some simple test
>> program...
>>
>>
>> Please do. I'd start by trying to mimic your scripts main request thing to
>> see if you can trigger individual sockets to end up in close_wait.
>>
>> Best regards,
>> Pasi
>>
>> On Sat, Feb 26, 2011 at 01:11, Paul Davis < <[email protected]>
>> [email protected]> wrote:
>> > It sounds like your client is doing something funky with shutting down
>> > connections to the server. If I were to guess I would say that it
>> > looks like you're leaking sockets between a call to shutdown and a
>> > call to close. Though to say for certain I'd need to see what state
>> > both ends of the socket are in.
>> >
>> > The errors about pread_iolist are hard to diagnose without the
>> > surrounding error messages but if I were to guess is that you're
>> > getting an emfile that cascades to the file gen_server which then
>> > notifies other people trying to issue read calls.
>> >
>> > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen < <[email protected]>[email protected]>
>> wrote:
>> >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB
>> >> is indeed using a lot of file descriptors, and the number is growing.
>> >> I restarted CouchDB (and my batch job), and after 40 minutes, it was
>> >> already using 300 fds; an hour later, the figure was 600; half
>> >> an hour later, almost 700.
>> >>
>> >> The default "ulimit -n" was 1024, so that adds up to couple of hours.
>> >> As a temporary fix I changed the limit to 64K, so it should work fine
>> >> for the weekend :-) (but not much longer)
>> >>
>> >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT
>> >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection
>> >> but CouchDB didn't call close() to release the fd? And that's
>> presumably
>> >> a bug somewhere in CouchDB, right?
>> >>
>> >> After the initial error message, the log had ~300 almost identical
>> >> error messages of the form (with different number after pread_iolist):
>> >>
>> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>]
>> >> Uncaught error in HTTP request: {exit,
>> >>                                 {normal,
>> >>                                  {gen_server,call,
>> >>                                   [<0.288.0>,
>> >>                                    {pread_iolist,812881959},
>> >>                                    infinity]}}}
>> >>
>> >> Does this give any hints where the bug might be?
>> >>
>> >> If not, I'll see if I can reproduce the bug somehow in more controlled
>> >> setting. (I have a hunch that this might involve the use of HTTP
>> >> expect/continue or POST requests with chunked transfer encoding,
>> >> but let's see....)
>> >>
>> >> Best regards,
>> >> Pasi
>> >>
>> >> On Fri, Feb 25, 2011 at 19:01, Paul Davis <<[email protected]>
>> [email protected]> wrote:
>> >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen < <[email protected]>[email protected]>
>> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I had a big batch job (inserting 10M+ documents and generating views
>> for them)
>> >>>> that ran just fine for about 6 hours, but then I got this error:
>> >>>>
>> >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server
>> >>>> <0.276.0> terminating
>> >>>> ** Last message in was delayed_commit
>> >>>> ** When Server state ==
>> {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>,
>> >>>>                            <0.273.0>,<0.277.0>,
>> >>>>                            {db_header,5,739828,0,
>> >>>>                                {4778613011,{663866,0}},
>> >>>>                                {4778614954,663866},
>> >>>>                                nil,0,nil,nil,1000},
>> >>>>                            739828,
>> >>>>                            {btree,<0.273.0>,
>> >>>>                                {4778772755,{663866,0}},
>> >>>>                                #Fun<couch_db_updater.7.10053969>,
>> >>>>                                #Fun<couch_db_updater.8.35220795>,
>> >>>>                                #Fun<couch_btree.5.124754102>,
>> >>>>                                #Fun<couch_db_updater.9.107593676>},
>> >>>>                            {btree,<0.273.0>,
>> >>>>                                {4778774698,663866},
>> >>>>                                #Fun<couch_db_updater.10.30996817>,
>> >>>>                                #Fun<couch_db_updater.11.96515267>,
>> >>>>                                #Fun<couch_btree.5.124754102>,
>> >>>>                                #Fun<couch_db_updater.12.117826253>},
>> >>>>                            {btree,<0.273.0>,nil,
>> >>>>                                #Fun<couch_btree.0.83553141>,
>> >>>>                                #Fun<couch_btree.1.30790806>,
>> >>>>                                #Fun<couch_btree.2.124754102>,nil},
>> >>>>                            739831,<<"foo_replication_tmp">>,
>> >>>>
>>  "/data/foo/couchdb-data/foo_replication_tmp.couch",
>> >>>>                            [],[],nil,
>> >>>>                            {user_ctx,null,[],undefined},
>> >>>>                            #Ref<0.0.1793.256453>,1000,
>> >>>>                            [before_header,after_header,on_file_open],
>> >>>>                            false}
>> >>>> ** Reason for termination ==
>> >>>> ** {{badmatch,{error,emfile}},
>> >>>>    [{couch_file,sync,1},
>> >>>>     {couch_db_updater,commit_data,2},
>> >>>>     {couch_db_updater,handle_info,2},
>> >>>>     {gen_server,handle_msg,5},
>> >>>>     {proc_lib,init_p_do_apply,3}]}
>> >>>>
>> >>>> (+lot of other messages with the same timestamp -- can send if
>> they're useful)
>> >>>>
>> >>>> Exactly at this time, the client got HTTP 500 status code; the
>> request
>> >>>> was a bulk get (POST
>> /foo_replication_tmp/_all_docs?include_docs=true).
>> >>>>
>> >>>> Just before this request, the client had made a PUT (updating an
>> existing
>> >>>> document) that got 200 status code, but apparently was not
>> successfully
>> >>>> committed to the disk (I'm using "delayed_commits=true" - for my app,
>> >>>> this is just fine). The client had received the new _rev value, but
>> when
>> >>>> it tried updating the same document a minute later, there was a
>> conflict
>> >>>> (and it's not possible that somebody else updated this same
>> document).
>> >>>>
>> >>>> About four hours later, there was a different error ("accept_failed"
>> >>>> sounds like some temporary problem with sockets?):
>> >>>>
>> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
>> {error_report,<0.31.0>,
>> >>>>              {<0.20693.4>,std_error,
>> >>>>               [{application,mochiweb},
>> >>>>                "Accept failed error","{error,emfile}"]}}
>> >>>>
>> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>]
>> {error_report,<0.31.0>,
>> >>>>    {<0.20693.4>,crash_report,
>> >>>>
>> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>> >>>>       {pid,<0.20693.4>},
>> >>>>       {registered_name,[]},
>> >>>>       {error_info,
>> >>>>           {exit,
>> >>>>               {error,accept_failed},
>> >>>>               [{mochiweb_socket_server,acceptor_loop,1},
>> >>>>                {proc_lib,init_p_do_apply,3}]}},
>> >>>>       {ancestors,
>> >>>>
>> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]},
>> >>>>       {messages,[]},
>> >>>>       {links,[<0.106.0>]},
>> >>>>       {dictionary,[]},
>> >>>>       {trap_exit,false},
>> >>>>       {status,running},
>> >>>>       {heap_size,233},
>> >>>>       {stack_size,24},
>> >>>>       {reductions,200}],
>> >>>>      []]}}
>> >>>>
>> >>>> (+lots of other messages within the next couple of minutes)
>> >>>>
>> >>>> The same error occured once more, about four hours later.
>> >>>>
>> >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting
>> >>>> what these error messages mean. (BTW, are these something I should
>> >>>> report as bugs in JIRA? I can do that, but I'd like to at least
>> understand
>> >>>> which parts of the error messages are actually relevant here :-)
>> >>>>
>> >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6.
>> >>>>
>> >>>> Best regards,
>> >>>> Pasi
>> >>>>
>> >>>
>> >>> The error you're getting is because CouchDB is running out of
>> >>> available file descriptors to use. Try increasing the limit for the
>> >>> user running CouchDB.
>> >>>
>> >>
>> >
>>
>>
>

Re: CouchDB 1.0.2 errors under load

Reply via email to