Yes, the client is doing something funky :-) Many of the requests it sends are POSTs with chunked transfer encoding, so it's being hit hard by issue COUCHDB-682. In other words, there's lot of closing and opening of TCP connections going on, and retrying requests that failed due to COUCHDB-682.
(The server behavior in COUCHDB-682 is probably strictly speaking not violating the HTTP spec -- the server is allowed to close an idle persistent connection at any time -- but it's certainly causing unusual behavior vs. opening and closing of TCP connections.) Output from netstat -anp looks like this: tcp 1 0 127.0.0.1:5984 127.0.0.1:51681 CLOSE_WAIT 19124/beam.smp tcp 1 0 127.0.0.1:5984 127.0.0.1:36576 CLOSE_WAIT 19124/beam.smp tcp 1 0 127.0.0.1:5984 127.0.0.1:35311 CLOSE_WAIT 19124/beam.smp (...) The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no longer around (if they would be, I guess the client would also run out of fds -- and the number of fds used by the client is not growing). Does this offer any clues for diagnosing the bug? I can also try if I can reproduce the fd leak with some simple test program... Best regards, Pasi On Sat, Feb 26, 2011 at 01:11, Paul Davis <[email protected]> wrote: > It sounds like your client is doing something funky with shutting down > connections to the server. If I were to guess I would say that it > looks like you're leaking sockets between a call to shutdown and a > call to close. Though to say for certain I'd need to see what state > both ends of the socket are in. > > The errors about pread_iolist are hard to diagnose without the > surrounding error messages but if I were to guess is that you're > getting an emfile that cascades to the file gen_server which then > notifies other people trying to issue read calls. > > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen <[email protected]> wrote: >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB >> is indeed using a lot of file descriptors, and the number is growing. >> I restarted CouchDB (and my batch job), and after 40 minutes, it was >> already using 300 fds; an hour later, the figure was 600; half >> an hour later, almost 700. >> >> The default "ulimit -n" was 1024, so that adds up to couple of hours. >> As a temporary fix I changed the limit to 64K, so it should work fine >> for the weekend :-) (but not much longer) >> >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection >> but CouchDB didn't call close() to release the fd? And that's presumably >> a bug somewhere in CouchDB, right? >> >> After the initial error message, the log had ~300 almost identical >> error messages of the form (with different number after pread_iolist): >> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>] >> Uncaught error in HTTP request: {exit, >> {normal, >> {gen_server,call, >> [<0.288.0>, >> {pread_iolist,812881959}, >> infinity]}}} >> >> Does this give any hints where the bug might be? >> >> If not, I'll see if I can reproduce the bug somehow in more controlled >> setting. (I have a hunch that this might involve the use of HTTP >> expect/continue or POST requests with chunked transfer encoding, >> but let's see....) >> >> Best regards, >> Pasi >> >> On Fri, Feb 25, 2011 at 19:01, Paul Davis <[email protected]> wrote: >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <[email protected]> wrote: >>>> Hi, >>>> >>>> I had a big batch job (inserting 10M+ documents and generating views for them) >>>> that ran just fine for about 6 hours, but then I got this error: >>>> >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server >>>> <0.276.0> terminating >>>> ** Last message in was delayed_commit >>>> ** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>, >>>> <0.273.0>,<0.277.0>, >>>> {db_header,5,739828,0, >>>> {4778613011,{663866,0}}, >>>> {4778614954,663866}, >>>> nil,0,nil,nil,1000}, >>>> 739828, >>>> {btree,<0.273.0>, >>>> {4778772755,{663866,0}}, >>>> #Fun<couch_db_updater.7.10053969>, >>>> #Fun<couch_db_updater.8.35220795>, >>>> #Fun<couch_btree.5.124754102>, >>>> #Fun<couch_db_updater.9.107593676>}, >>>> {btree,<0.273.0>, >>>> {4778774698,663866}, >>>> #Fun<couch_db_updater.10.30996817>, >>>> #Fun<couch_db_updater.11.96515267>, >>>> #Fun<couch_btree.5.124754102>, >>>> #Fun<couch_db_updater.12.117826253>}, >>>> {btree,<0.273.0>,nil, >>>> #Fun<couch_btree.0.83553141>, >>>> #Fun<couch_btree.1.30790806>, >>>> #Fun<couch_btree.2.124754102>,nil}, >>>> 739831,<<"foo_replication_tmp">>, >>>> "/data/foo/couchdb-data/foo_replication_tmp.couch", >>>> [],[],nil, >>>> {user_ctx,null,[],undefined}, >>>> #Ref<0.0.1793.256453>,1000, >>>> [before_header,after_header,on_file_open], >>>> false} >>>> ** Reason for termination == >>>> ** {{badmatch,{error,emfile}}, >>>> [{couch_file,sync,1}, >>>> {couch_db_updater,commit_data,2}, >>>> {couch_db_updater,handle_info,2}, >>>> {gen_server,handle_msg,5}, >>>> {proc_lib,init_p_do_apply,3}]} >>>> >>>> (+lot of other messages with the same timestamp -- can send if they're useful) >>>> >>>> Exactly at this time, the client got HTTP 500 status code; the request >>>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true). >>>> >>>> Just before this request, the client had made a PUT (updating an existing >>>> document) that got 200 status code, but apparently was not successfully >>>> committed to the disk (I'm using "delayed_commits=true" - for my app, >>>> this is just fine). The client had received the new _rev value, but when >>>> it tried updating the same document a minute later, there was a conflict >>>> (and it's not possible that somebody else updated this same document). >>>> >>>> About four hours later, there was a different error ("accept_failed" >>>> sounds like some temporary problem with sockets?): >>>> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>, >>>> {<0.20693.4>,std_error, >>>> [{application,mochiweb}, >>>> "Accept failed error","{error,emfile}"]}} >>>> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] {error_report,<0.31.0>, >>>> {<0.20693.4>,crash_report, >>>> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}}, >>>> {pid,<0.20693.4>}, >>>> {registered_name,[]}, >>>> {error_info, >>>> {exit, >>>> {error,accept_failed}, >>>> [{mochiweb_socket_server,acceptor_loop,1}, >>>> {proc_lib,init_p_do_apply,3}]}}, >>>> {ancestors, >>>> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]}, >>>> {messages,[]}, >>>> {links,[<0.106.0>]}, >>>> {dictionary,[]}, >>>> {trap_exit,false}, >>>> {status,running}, >>>> {heap_size,233}, >>>> {stack_size,24}, >>>> {reductions,200}], >>>> []]}} >>>> >>>> (+lots of other messages within the next couple of minutes) >>>> >>>> The same error occured once more, about four hours later. >>>> >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting >>>> what these error messages mean. (BTW, are these something I should >>>> report as bugs in JIRA? I can do that, but I'd like to at least understand >>>> which parts of the error messages are actually relevant here :-) >>>> >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6. >>>> >>>> Best regards, >>>> Pasi >>>> >>> >>> The error you're getting is because CouchDB is running out of >>> available file descriptors to use. Try increasing the limit for the >>> user running CouchDB. >>> >> >
