I've finally managed to reproduce this problem in a test environment, and filed a bug report:
https://issues.apache.org/jira/browse/COUCHDB-1100 Best regards, Pasi On Sat, Feb 26, 2011 at 11:19, Pasi Eronen <[email protected]> wrote: > BTW, it seems just changing "ulimit -n" did not work. At some point, > the server just stopped responding completely (even the root URL just > times out, and there's nothing in the error log). > > A quick googling found references to enabling kernel poll (at build time > and on erl command line), and increasing ERL_MAX_PORTS > environment variable. Based on a quick look it seems kernel poll > was already enabled, but I'll experiment with ERL_MAX_PORTS... > > Best regards, > Pasi > > > > On Sat, Feb 26, 2011 at 03:40, Paul J. Davis > <[email protected]>wrote: > >> >> >> On Feb 25, 2011, at 7:31 PM, Pasi Eronen <[email protected]> wrote: >> >> Yes, the client is doing something funky :-) Many of the requests >> it sends are POSTs with chunked transfer encoding, so it's being >> hit hard by issue COUCHDB-682. In other words, there's lot of closing >> and opening of TCP connections going on, and retrying requests that >> failed due to COUCHDB-682. >> >> (The server behavior in COUCHDB-682 is probably strictly speaking not >> violating the HTTP spec -- the server is allowed to close an idle >> persistent >> connection at any time -- but it's certainly causing unusual behavior >> vs. opening and closing of TCP connections.) >> >> Output from netstat -anp looks like this: >> >> tcp 1 0 127.0.0.1:5984 127.0.0.1:51681 CLOSE_WAIT 19124/beam.smp >> tcp 1 0 127.0.0.1:5984 127.0.0.1:36576 CLOSE_WAIT 19124/beam.smp >> tcp 1 0 127.0.0.1:5984 127.0.0.1:35311 CLOSE_WAIT 19124/beam.smp >> (...) >> >> The other directions (e.g. 127.0.0.1:51681 -> 127.0.0.1:5984) are no >> longer >> around (if they would be, I guess the client would also run out of fds -- >> and >> the number of fds used by the client is not growing). >> >> Does this offer any clues for diagnosing the bug? >> >> >> Its definitely pointing the finger at CouchDB/erlang. >> >> I can also try if I can reproduce the fd leak with some simple test >> program... >> >> >> Please do. I'd start by trying to mimic your scripts main request thing to >> see if you can trigger individual sockets to end up in close_wait. >> >> Best regards, >> Pasi >> >> On Sat, Feb 26, 2011 at 01:11, Paul Davis < <[email protected]> >> [email protected]> wrote: >> > It sounds like your client is doing something funky with shutting down >> > connections to the server. If I were to guess I would say that it >> > looks like you're leaking sockets between a call to shutdown and a >> > call to close. Though to say for certain I'd need to see what state >> > both ends of the socket are in. >> > >> > The errors about pread_iolist are hard to diagnose without the >> > surrounding error messages but if I were to guess is that you're >> > getting an emfile that cascades to the file gen_server which then >> > notifies other people trying to issue read calls. >> > >> > On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen < <[email protected]>[email protected]> >> wrote: >> >> Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB >> >> is indeed using a lot of file descriptors, and the number is growing. >> >> I restarted CouchDB (and my batch job), and after 40 minutes, it was >> >> already using 300 fds; an hour later, the figure was 600; half >> >> an hour later, almost 700. >> >> >> >> The default "ulimit -n" was 1024, so that adds up to couple of hours. >> >> As a temporary fix I changed the limit to 64K, so it should work fine >> >> for the weekend :-) (but not much longer) >> >> >> >> Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT >> >> state. Hmm. It seems CLOSE_WAIT means the client closed the connection >> >> but CouchDB didn't call close() to release the fd? And that's >> presumably >> >> a bug somewhere in CouchDB, right? >> >> >> >> After the initial error message, the log had ~300 almost identical >> >> error messages of the form (with different number after pread_iolist): >> >> >> >> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>] >> >> Uncaught error in HTTP request: {exit, >> >> {normal, >> >> {gen_server,call, >> >> [<0.288.0>, >> >> {pread_iolist,812881959}, >> >> infinity]}}} >> >> >> >> Does this give any hints where the bug might be? >> >> >> >> If not, I'll see if I can reproduce the bug somehow in more controlled >> >> setting. (I have a hunch that this might involve the use of HTTP >> >> expect/continue or POST requests with chunked transfer encoding, >> >> but let's see....) >> >> >> >> Best regards, >> >> Pasi >> >> >> >> On Fri, Feb 25, 2011 at 19:01, Paul Davis <<[email protected]> >> [email protected]> wrote: >> >>> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen < <[email protected]>[email protected]> >> wrote: >> >>>> Hi, >> >>>> >> >>>> I had a big batch job (inserting 10M+ documents and generating views >> for them) >> >>>> that ran just fine for about 6 hours, but then I got this error: >> >>>> >> >>>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server >> >>>> <0.276.0> terminating >> >>>> ** Last message in was delayed_commit >> >>>> ** When Server state == >> {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>, >> >>>> <0.273.0>,<0.277.0>, >> >>>> {db_header,5,739828,0, >> >>>> {4778613011,{663866,0}}, >> >>>> {4778614954,663866}, >> >>>> nil,0,nil,nil,1000}, >> >>>> 739828, >> >>>> {btree,<0.273.0>, >> >>>> {4778772755,{663866,0}}, >> >>>> #Fun<couch_db_updater.7.10053969>, >> >>>> #Fun<couch_db_updater.8.35220795>, >> >>>> #Fun<couch_btree.5.124754102>, >> >>>> #Fun<couch_db_updater.9.107593676>}, >> >>>> {btree,<0.273.0>, >> >>>> {4778774698,663866}, >> >>>> #Fun<couch_db_updater.10.30996817>, >> >>>> #Fun<couch_db_updater.11.96515267>, >> >>>> #Fun<couch_btree.5.124754102>, >> >>>> #Fun<couch_db_updater.12.117826253>}, >> >>>> {btree,<0.273.0>,nil, >> >>>> #Fun<couch_btree.0.83553141>, >> >>>> #Fun<couch_btree.1.30790806>, >> >>>> #Fun<couch_btree.2.124754102>,nil}, >> >>>> 739831,<<"foo_replication_tmp">>, >> >>>> >> "/data/foo/couchdb-data/foo_replication_tmp.couch", >> >>>> [],[],nil, >> >>>> {user_ctx,null,[],undefined}, >> >>>> #Ref<0.0.1793.256453>,1000, >> >>>> [before_header,after_header,on_file_open], >> >>>> false} >> >>>> ** Reason for termination == >> >>>> ** {{badmatch,{error,emfile}}, >> >>>> [{couch_file,sync,1}, >> >>>> {couch_db_updater,commit_data,2}, >> >>>> {couch_db_updater,handle_info,2}, >> >>>> {gen_server,handle_msg,5}, >> >>>> {proc_lib,init_p_do_apply,3}]} >> >>>> >> >>>> (+lot of other messages with the same timestamp -- can send if >> they're useful) >> >>>> >> >>>> Exactly at this time, the client got HTTP 500 status code; the >> request >> >>>> was a bulk get (POST >> /foo_replication_tmp/_all_docs?include_docs=true). >> >>>> >> >>>> Just before this request, the client had made a PUT (updating an >> existing >> >>>> document) that got 200 status code, but apparently was not >> successfully >> >>>> committed to the disk (I'm using "delayed_commits=true" - for my app, >> >>>> this is just fine). The client had received the new _rev value, but >> when >> >>>> it tried updating the same document a minute later, there was a >> conflict >> >>>> (and it's not possible that somebody else updated this same >> document). >> >>>> >> >>>> About four hours later, there was a different error ("accept_failed" >> >>>> sounds like some temporary problem with sockets?): >> >>>> >> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] >> {error_report,<0.31.0>, >> >>>> {<0.20693.4>,std_error, >> >>>> [{application,mochiweb}, >> >>>> "Accept failed error","{error,emfile}"]}} >> >>>> >> >>>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] >> {error_report,<0.31.0>, >> >>>> {<0.20693.4>,crash_report, >> >>>> >> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}}, >> >>>> {pid,<0.20693.4>}, >> >>>> {registered_name,[]}, >> >>>> {error_info, >> >>>> {exit, >> >>>> {error,accept_failed}, >> >>>> [{mochiweb_socket_server,acceptor_loop,1}, >> >>>> {proc_lib,init_p_do_apply,3}]}}, >> >>>> {ancestors, >> >>>> >> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]}, >> >>>> {messages,[]}, >> >>>> {links,[<0.106.0>]}, >> >>>> {dictionary,[]}, >> >>>> {trap_exit,false}, >> >>>> {status,running}, >> >>>> {heap_size,233}, >> >>>> {stack_size,24}, >> >>>> {reductions,200}], >> >>>> []]}} >> >>>> >> >>>> (+lots of other messages within the next couple of minutes) >> >>>> >> >>>> The same error occured once more, about four hours later. >> >>>> >> >>>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting >> >>>> what these error messages mean. (BTW, are these something I should >> >>>> report as bugs in JIRA? I can do that, but I'd like to at least >> understand >> >>>> which parts of the error messages are actually relevant here :-) >> >>>> >> >>>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6. >> >>>> >> >>>> Best regards, >> >>>> Pasi >> >>>> >> >>> >> >>> The error you're getting is because CouchDB is running out of >> >>> available file descriptors to use. Try increasing the limit for the >> >>> user running CouchDB. >> >>> >> >> >> > >> >> >
