It sounds like your client is doing something funky with shutting down connections to the server. If I were to guess I would say that it looks like you're leaking sockets between a call to shutdown and a call to close. Though to say for certain I'd need to see what state both ends of the socket are in.
The errors about pread_iolist are hard to diagnose without the surrounding error messages but if I were to guess is that you're getting an emfile that cascades to the file gen_server which then notifies other people trying to issue read calls. On Fri, Feb 25, 2011 at 5:43 PM, Pasi Eronen <[email protected]> wrote: > Thanks! I looked at /proc/(pid-of-beam.smp)/fd, and it seems CouchDB > is indeed using a lot of file descriptors, and the number is growing. > I restarted CouchDB (and my batch job), and after 40 minutes, it was > already using 300 fds; an hour later, the figure was 600; half > an hour later, almost 700. > > The default "ulimit -n" was 1024, so that adds up to couple of hours. > As a temporary fix I changed the limit to 64K, so it should work fine > for the weekend :-) (but not much longer) > > Based on "netstat -an", the fds are TCP connections stuck in CLOSE_WAIT > state. Hmm. It seems CLOSE_WAIT means the client closed the connection > but CouchDB didn't call close() to release the fd? And that's presumably > a bug somewhere in CouchDB, right? > > After the initial error message, the log had ~300 almost identical > error messages of the form (with different number after pread_iolist): > > [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.31510.2>] > Uncaught error in HTTP request: {exit, > {normal, > {gen_server,call, > [<0.288.0>, > {pread_iolist,812881959}, > infinity]}}} > > Does this give any hints where the bug might be? > > If not, I'll see if I can reproduce the bug somehow in more controlled > setting. (I have a hunch that this might involve the use of HTTP > expect/continue or POST requests with chunked transfer encoding, > but let's see....) > > Best regards, > Pasi > > On Fri, Feb 25, 2011 at 19:01, Paul Davis <[email protected]> wrote: >> On Fri, Feb 25, 2011 at 4:18 AM, Pasi Eronen <[email protected]> wrote: >>> Hi, >>> >>> I had a big batch job (inserting 10M+ documents and generating views for >>> them) >>> that ran just fine for about 6 hours, but then I got this error: >>> >>> [Thu, 24 Feb 2011 19:42:57 GMT] [error] [<0.276.0>] ** Generic server >>> <0.276.0> terminating >>> ** Last message in was delayed_commit >>> ** When Server state == {db,<0.275.0>,<0.276.0>,nil,<<"1298547642391489">>, >>> <0.273.0>,<0.277.0>, >>> {db_header,5,739828,0, >>> {4778613011,{663866,0}}, >>> {4778614954,663866}, >>> nil,0,nil,nil,1000}, >>> 739828, >>> {btree,<0.273.0>, >>> {4778772755,{663866,0}}, >>> #Fun<couch_db_updater.7.10053969>, >>> #Fun<couch_db_updater.8.35220795>, >>> #Fun<couch_btree.5.124754102>, >>> #Fun<couch_db_updater.9.107593676>}, >>> {btree,<0.273.0>, >>> {4778774698,663866}, >>> #Fun<couch_db_updater.10.30996817>, >>> #Fun<couch_db_updater.11.96515267>, >>> #Fun<couch_btree.5.124754102>, >>> #Fun<couch_db_updater.12.117826253>}, >>> {btree,<0.273.0>,nil, >>> #Fun<couch_btree.0.83553141>, >>> #Fun<couch_btree.1.30790806>, >>> #Fun<couch_btree.2.124754102>,nil}, >>> 739831,<<"foo_replication_tmp">>, >>> >>> "/data/foo/couchdb-data/foo_replication_tmp.couch", >>> [],[],nil, >>> {user_ctx,null,[],undefined}, >>> #Ref<0.0.1793.256453>,1000, >>> [before_header,after_header,on_file_open], >>> false} >>> ** Reason for termination == >>> ** {{badmatch,{error,emfile}}, >>> [{couch_file,sync,1}, >>> {couch_db_updater,commit_data,2}, >>> {couch_db_updater,handle_info,2}, >>> {gen_server,handle_msg,5}, >>> {proc_lib,init_p_do_apply,3}]} >>> >>> (+lot of other messages with the same timestamp -- can send if they're >>> useful) >>> >>> Exactly at this time, the client got HTTP 500 status code; the request >>> was a bulk get (POST /foo_replication_tmp/_all_docs?include_docs=true). >>> >>> Just before this request, the client had made a PUT (updating an existing >>> document) that got 200 status code, but apparently was not successfully >>> committed to the disk (I'm using "delayed_commits=true" - for my app, >>> this is just fine). The client had received the new _rev value, but when >>> it tried updating the same document a minute later, there was a conflict >>> (and it's not possible that somebody else updated this same document). >>> >>> About four hours later, there was a different error ("accept_failed" >>> sounds like some temporary problem with sockets?): >>> >>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] >>> {error_report,<0.31.0>, >>> {<0.20693.4>,std_error, >>> [{application,mochiweb}, >>> "Accept failed error","{error,emfile}"]}} >>> >>> [Thu, 24 Feb 2011 23:55:42 GMT] [error] [<0.20693.4>] >>> {error_report,<0.31.0>, >>> {<0.20693.4>,crash_report, >>> [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}}, >>> {pid,<0.20693.4>}, >>> {registered_name,[]}, >>> {error_info, >>> {exit, >>> {error,accept_failed}, >>> [{mochiweb_socket_server,acceptor_loop,1}, >>> {proc_lib,init_p_do_apply,3}]}}, >>> {ancestors, >>> [couch_httpd,couch_secondary_services,couch_server_sup,<0.32.0>]}, >>> {messages,[]}, >>> {links,[<0.106.0>]}, >>> {dictionary,[]}, >>> {trap_exit,false}, >>> {status,running}, >>> {heap_size,233}, >>> {stack_size,24}, >>> {reductions,200}], >>> []]}} >>> >>> (+lots of other messages within the next couple of minutes) >>> >>> The same error occured once more, about four hours later. >>> >>> I'm quite new to CouchDB, so I'd appreciate any help in interpreting >>> what these error messages mean. (BTW, are these something I should >>> report as bugs in JIRA? I can do that, but I'd like to at least understand >>> which parts of the error messages are actually relevant here :-) >>> >>> I'm running CouchDB 1.0.2 with Erlang R14B on 64-bit RHEL 5.6. >>> >>> Best regards, >>> Pasi >>> >> >> The error you're getting is because CouchDB is running out of >> available file descriptors to use. Try increasing the limit for the >> user running CouchDB. >> >
