On Feb 25, 2011, at 7:31 PM, Pasi Eronen <[email protected]> wrote:
> Yes, the client is doing something funky :-) Many of the requests > it sends are POSTs with chunked transfer encoding, so it's being > hit hard by issue COUCHDB-682. In other words, there's lot of closing > and opening of TCP connections going on, and retrying requests that > failed due to COUCHDB-682. > > (The server behavior in COUCHDB-682 is probably strictly speaking not > violating the HTTP spec -- the server is allowed to close an idle persistent > connection at any time -- but it's certainly causing unusual behavior > vs. opening and closing of TCP connections.) > > Output from netstat -anp looks like this: > > tcp 1 0 127.0.0.1:5984 127.0.0.1:51681 CLOSE_WAIT 19124/beam.smp > tcp 1 0 127.0.0.1:5984 127.0.0.1:36576 CLOSE_WAIT 19124/beam.smp > tcp 1 0 127.0.0.1:5984 127.0.0.1:35311 CLOSE_WAIT 19124/beam.smp > (...) > > The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no longer > around (if they would be, I guess the client would also run out of fds -- and > the number of fds used by the client is not growing). > > Does this offer any clues for diagnosing the bug? > Its definitely pointing the finger at CouchDB/erlang. > I can also try if I can reproduce the fd leak with some simple test program... > Please do. I'd start by trying to mimic your scripts main request thing to see if you can trigger individual sockets to end up in close_wait. > Best regards, > Pasi > > On Sat, Feb 26, 2011 at 01:11, Paul Davis <[email protected]> wrote: > > It sounds like your client is doing something funky with shutting down > > connections to the server. If I were to guess I would say that it > > looks like you're leaking sockets between a call to shutdown and a > > call to close. Though to say for certain I'd need to see what state > > both ends of the socket are in. > > > > The errors about pread_iolist are hard to diagnose without the > > surrounding error messages but if I were to guess is that you're > > getting an emfile that cascades to the file gen_server which then > > notifies other people trying to issue read calls. > > > > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen <[email protected]> wrote: > >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB > >> is indeed using a lot of file descriptors, and the number is growing. > >> I restarted CouchDB (and my batch job), and after 40 minutes, it was > >> already using 300 fds; an hour later, the figure was 600; half > >> an hour later, almost 700. > >> > >> The default "ulimit -n" was 1024, so that adds up to couple of hours. > >> As a temporary fix I changed the limit to 64K, so it should work fine > >> for the weekend :-) (but not much longer) > >> > >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT > >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection > >> but CouchDB didn't call close() to release the fd? And that's presumably > >> a bug somewhere in CouchDB, right? > >> > >> After the initial error message, the log had ~300 almost identical > >> error messages of the form (with different number after pread_iolist): > >> > >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>] > >> Uncaught error in HTTP request: {exit, > >> {normal, > >> {gen_server,call, > >> [<0.288.0>, > >> {pread_iolist,812881959}, > >> infinity]}}} > >> > >> Does this give any hints where the bug might be? > >> > >> If not, I'll see if I can reproduce the bug somehow in more controlled > >> setting. (I have a hunch that this might involve the use of HTTP > >> expect/continue or POST requests with chunked transfer encoding, > >> but let's see....) > >> > >> Best regards, > >> Pasi > >> > >> On Fri, Feb 25, 2011 at 19:01, Paul Davis <[email protected]> > >> wrote: > >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <[email protected]> wrote: > >>>> Hi, > >>>> > >>>> I had a big batch job (inserting 10M+ documents and generating views for > >>>> them) > >>>> that ran just fine for about 6 hours, but then I got this error: > >>>> > >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server > >>>> <0.276.0> terminating > >>>> ** Last message in was delayed_commit > >>>> ** When Server state == > >>>> {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>, > >>>> <0.273.0>,<0.277.0>, > >>>> {db_header,5,739828,0, > >>>> {4778613011,{663866,0}}, > >>>> {4778614954,663866}, > >>>> nil,0,nil,nil,1000}, > >>>> 739828, > >>>> {btree,<0.273.0>, > >>>> {4778772755,{663866,0}}, > >>>> #Fun<couch_db_updater.7.10053969>, > >>>> #Fun<couch_db_updater.8.35220795>, > >>>> #Fun<couch_btree.5.124754102>, > >>>> #Fun<couch_db_updater.9.107593676>}, > >>>> {btree,<0.273.0>, > >>>> {4778774698,663866}, > >>>> #Fun<couch_db_updater.10.30996817>, > >>>> #Fun<couch_db_updater.11.96515267>, > >>>> #Fun<couch_btree.5.124754102>, > >>>> #Fun<couch_db_updater.12.117826253>}, > >>>> {btree,<0.273.0>,nil, > >>>> #Fun<couch_btree.0.83553141>, > >>>> #Fun<couch_btree.1.30790806>, > >>>> #Fun<couch_btree.2.124754102>,nil}, > >>>> 739831,<<"foo_replication_tmp">>, > >>>> > >>>> "/data/foo/couchdb-data/foo_replication_tmp.couch", > >>>> [],[],nil, > >>>> {user_ctx,null,[],undefined}, > >>>> #Ref<0.0.1793.256453>,1000, > >>>> [before_header,after_header,on_file_open], > >>>> false} > >>>> ** Reason for termination == > >>>> ** {{badmatch,{error,emfile}}, > >>>> [{couch_file,sync,1}, > >>>> {couch_db_updater,commit_data,2}, > >>>> {couch_db_updater,handle_info,2}, > >>>> {gen_server,handle_msg,5}, > >>>> {proc_lib,init_p_do_apply,3}]} > >>>> > >>>> (+lot of other messages with the same timestamp -- can send if they're > >>>> useful) > >>>> > >>>> Exactly at this time, the client got HTTP 500 status code; the request > >>>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true). > >>>> > >>>> Just before this request, the client had made a PUT (updating an existing > >>>> document) that got 200 status code, but apparently was not successfully > >>>> committed to the disk (I'm using "delayed_commits=true" - for my app, > >>>> this is just fine). The client had received the new _rev value, but when > >>>> it tried updating the same document a minute later, there was a conflict > >>>> (and it's not possible that somebody else updated this same document). > >>>> > >>>> About four hours later, there was a different error ("accept_failed" > >>>> sounds like some temporary problem with sockets?): > >>>> > >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] > >>>> {error_report,<0.31.0>, > >>>> {<0.20693.4>,std_error, > >>>> [{application,mochiweb}, > >>>> "Accept failed error","{error,emfile}"]}} > >>>> > >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] > >>>> {error_report,<0.31.0>, > >>>> {<0.20693.4>,crash_report, > >>>> > >>>> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}}, > >>>> {pid,<0.20693.4>}, > >>>> {registered_name,[]}, > >>>> {error_info, > >>>> {exit, > >>>> {error,accept_failed}, > >>>> [{mochiweb_socket_server,acceptor_loop,1}, > >>>> {proc_lib,init_p_do_apply,3}]}}, > >>>> {ancestors, > >>>> > >>>> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]}, > >>>> {messages,[]}, > >>>> {links,[<0.106.0>]}, > >>>> {dictionary,[]}, > >>>> {trap_exit,false}, > >>>> {status,running}, > >>>> {heap_size,233}, > >>>> {stack_size,24}, > >>>> {reductions,200}], > >>>> []]}} > >>>> > >>>> (+lots of other messages within the next couple of minutes) > >>>> > >>>> The same error occured once more, about four hours later. > >>>> > >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting > >>>> what these error messages mean. (BTW, are these something I should > >>>> report as bugs in JIRA? I can do that, but I'd like to at least > >>>> understand > >>>> which parts of the error messages are actually relevant here :-) > >>>> > >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6. > >>>> > >>>> Best regards, > >>>> Pasi > >>>> > >>> > >>> The error you're getting is because CouchDB is running out of > >>> available file descriptors to use. Try increasing the limit for the > >>> user running CouchDB. > >>> > >> > > >
