Oh, well, I haven't seen that crash for a couple of days. For me, the solution to the problem is getting rid of the (crashing) Python process that forwards requests to couchdb-lucene. The latest versions of couchdb-lucene don't need it at all - they can work with CouchDB's proxying feature. That isn't too straightforward since I am using the "ektorplucene" library on the JVM which doesn't support the new URL scheme though.
I'll probably tweak the library a bit, get rid of the Python process, and report back. Best, Hristo On Fri, Dec 9, 2011 at 2:24 PM, Hristo Deshev <[email protected]> wrote: > Hi CGS, > > Thanks for your help! I dug deeper in that log file and found a lot of "OS > process timed out." errors generated both by the <0.86.0> process working > with one database and other processes working with other databases. I think > the process in question is the python script that forwards requests to > couchdb-lucene. Now, right after the last timeout error, I see this: > > [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, > {<0.86.0>,supervisor_report, > [{supervisor,{local,couch_secondary_services}}, > {errorContext,shutdown}, > {reason,reached_max_restart_intensity}, > {offender, > [{pid,<0.8520.1>}, > {name,external_manager}, > > {mfargs,{couch_external_manager,start_link,[]}}, > {restart_type,permanent}, > {shutdown,1000}, > {child_type,worker}]}]}} > > ...with the "reached_max_restart_intensity" part looking fishy to me. I > couldn't find any docs on that one, but I assume that the process dies too > often and CouchDB gives up restarting it. But why would that bring the > entire DB server down instead of "just" breaking couchdb-lucene queries? > > I already did some changes that should make the Lucene timeouts happen a > lot less often. Is there a setting that I can change and up the restart > intensity value? > > I checked that the disk space and max file sizes are okay. I have both > enough free space and I host larger files on that filesystem. I am pasting > the log file snippet below. > > Best, > Hristo > > =========== > > [Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, > {<0.86.0>,supervisor_report, > [{supervisor,{local,couch_secondary_services}}, > {errorContext,child_terminated}, > {reason,normal}, > {offender, > [{pid,<0.132.0>}, > {name,external_manager}, > > {mfargs,{couch_external_manager,start_link,[]}}, > {restart_type,permanent}, > {shutdown,1000}, > {child_type,worker}]}]}} > [Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.486.0>] Uncaught error in HTTP > request: {exit, > {{bad_return_value, > {os_process_error, > "OS process timed > out."}}, > {gen_server,call, > [<0.525.0>, > {execute, > {[{<<"info">>, > {[{db_name, > > <<"database1">>}, > {doc_count, > 13343011}, > > {doc_del_count, > 0}, > {update_seq, > 13343013}, > > {purge_seq,0}, > > {compact_running, > false}, > {disk_size, > 6628601611 > }, > > {instance_start_time, > > <<"1323371411352029">>}, > > {disk_format_version, > 5}, > > {committed_update_seq, > > 13342813}]}}, > {<<"id">>,null}, > {<<"uuid">>, > > <<"0b482d09cd97c1bada0d53cfab9f14b2">>}, > {<<"method">>, > 'GET'}, > > {<<"requested_path">>, > > [<<"database1">>, > <<"_fti">>, > <<"_design">>, > > <<"log-search">>, > > <<"everything?q=text:error&limit=1">>]}, > {<<"path">>, > > [<<"database1">>, > <<"_fti">>, > <<"_design">>, > > <<"log-search">>, > > <<"everything">>]}, > {<<"query">>, > {[{<<"q">>, > > <<"text:error">>}, > {<<"limit">>, > <<"1">>}]}}, > {<<"headers">>, > > {[{<<"Connection">>, > > <<"Keep-Alive">>}, > {<<"Host">>, > > <<"couch2.internal:5984">>}]}}, > {<<"body">>, > undefined}, > {<<"peer">>, > > <<"127.0.0.1">>}, > > {<<"form">>,{[]}}, > > {<<"cookie">>,{[]}}, > {<<"userCtx">>, > {[{<<"db">>, > > <<"database1">>}, > {<<"name">>, > null}, > {<<"roles">>, > > [<<"_admin">>]}]}}, > {<<"secObj">>, > {[]}}]}}, > infinity]}}} > > > ... > [A LOT MORE MESSAGES LIKE THE PREVIOUS ONE] > ... > > [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, > {<0.86.0>,supervisor_report, > [{supervisor,{local,couch_secondary_services}}, > {errorContext,child_terminated}, > {reason,normal}, > {offender, > [{pid,<0.8520.1>}, > {name,external_manager}, > > {mfargs,{couch_external_manager,start_link,[]}}, > {restart_type,permanent}, > {shutdown,1000}, > {child_type,worker}]}]}} > [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>, > {<0.86.0>,supervisor_report, > [{supervisor,{local,couch_secondary_services}}, > {errorContext,shutdown}, > {reason,reached_max_restart_intensity}, > {offender, > [{pid,<0.8520.1>}, > {name,external_manager}, > > {mfargs,{couch_external_manager,start_link,[]}}, > {restart_type,permanent}, > {shutdown,1000}, > {child_type,worker}]}]}} > [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.486.0>] Uncaught error in HTTP > request: {exit, > {{bad_return_value, > {os_process_error, > "OS process timed > out."}}, > {gen_server,call, > [<0.8529.1>, > {execute, > {[{<<"info">>, > {[{db_name, > > <<"database2">>}, > {doc_count, > 7416681}, > > {doc_del_count, > 0}, > {update_seq, > 7416683}, > > {purge_seq,0}, > > {compact_running, > false}, > {disk_size, > 4226396124}, > > {instance_start_time, > > <<"1323371428396966">>}, > > {disk_format_version, > 5}, > > {committed_update_seq, > 7416483}]}}, > {<<"id">>,null}, > {<<"uuid">>, > > <<"87792b1884f91df492a069ec2e3d76a8">>}, > {<<"method">>, > 'GET'}, > > {<<"requested_path">>, > > [<<"database2">>, > <<"_fti">>, > <<"_design">>, > > <<"log-search">>, > > <<"everything?q=text:error&limit=1">>]}, > {<<"path">>, > > [<<"database2">>, > <<"_fti">>, > <<"_design">>, > > <<"log-search">>, > > <<"everything">>]}, > {<<"query">>, > {[{<<"q">>, > > <<"text:error">>}, > {<<"limit">>, > <<"1">>}]}}, > {<<"headers">>, > > {[{<<"Connection">>, > > <<"Keep-Alive">>}, > {<<"Host">>, > > <<"couch2.internal:5984">>}]}}, > {<<"body">>, > undefined}, > {<<"peer">>, > > <<"127.0.0.1">>}, > > {<<"form">>,{[]}}, > > {<<"cookie">>,{[]}}, > {<<"userCtx">>, > {[{<<"db">>, > > <<"database2">>}, > {<<"name">>, > null}, > {<<"roles">>, > > [<<"_admin">>]}]}}, > {<<"secObj">>, > {[]}}]}}, > infinity]}}} > [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>] {error_report,<0.31.0>, > {<0.78.0>,supervisor_report, > [{supervisor,{local,couch_server_sup}}, > {errorContext,child_terminated}, > {reason,shutdown}, > {offender, > [{pid,<0.86.0>}, > {name,couch_secondary_services}, > {mfargs, > > {couch_server_sup,start_secondary_services, > []}}, > {restart_type,permanent}, > {shutdown,infinity}, > {child_type,supervisor}]}]}} > =========== > > > On Fri, Dec 9, 2011 at 11:10 AM, CGS <[email protected]> wrote: > >> Hi Hristo, >> >> The problem is before what you presented from your log. For example, >> could you copy-paste information about process <0.86.0>? From the look of >> the log, it seems something (I couldn't find what from the part of the log >> you presented, but it seems that process <0.86.0> could provide some hints >> about that something) is crashing the generic server (without stopping the >> application) repeatedly and that's the reason for which you cannot access >> your documents. That explains why you are able to "fix" the problem by >> restarting CouchDB and also your message "OS process has timed out" (which >> is the effect, and not the cause of that problem). What is crashing your >> server, no idea, but the log should be able to provide the necessary >> information. >> >> It may be also a good idea to check the space on your harddisk or the >> maximum file size allowed by your OS. But firstly, I would check what >> information that process <0.86.0> could provide. >> >> CGS >> >> >> >> >> >> On 12/08/2011 10:41 PM, Hristo Deshev wrote: >> >>> Hi everyone, >>> >>> I moved some data from an Amazon EC2 small instance to a large one and in >>> the process upgraded from CouchDB 1.1.0 to CouchDB 1.1.1. I also went >>> with >>> Erlang R14B04 instead of R14B03 (Hurray for commando updates!) and now my >>> CouchDB instance seems to sometimes die when under heavy load. By >>> "dying" I >>> mean that the beam process seems to stay in memory, but the HTTP server >>> is >>> gone and no requests get served. For now I "fix" this by stopping and >>> restarting the process. >>> >>> Here are some details on my setup. The server is running a 64-bit Ubuntu >>> Server (Oneiric) Amazon EC2 image on a large instance with 2 CPU cores >>> and >>> 7.5 GB RAM. I build both Erlang and CouchDB from source. I collect log >>> entries and bulk insert them in batches of up to 200 documents. I also >>> run >>> couchdb-lucene on the same host and I *think* most of the crashes happen >>> when couchdb-lucene is running a tough query and is hogging the CPU or >>> the >>> HDD. I have some largish db's (~50 million documents, ~25 GB in disk >>> space). I plan on splitting my dbs into smaller ones. I hope that gets me >>> more responsive file access and faster full text index searches. I think >>> my >>> lucene indexes may be getting too large for that machine's memory and it >>> can't serve them too well. I frequently get "OS process has timed out" >>> errors when trying to query those indexes. Anyway, that shouldn't be >>> crashing the core couchdb process, right? >>> >>> I am pasting my idea of what the relevant portion of the couchdb log file >>> is below, hoping somebody could decipher something out of it. Am I >>> correct >>> in thinking that the "** Reason for termination == ** {timeout," part >>> means >>> the process is crashing since writing to or reading from a file timed >>> out? >>> Any help is greatly appreciated. >>> >>> Best, >>> Hristo >>> >>> =============== >>> >>> [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>] >>> {error_report,<0.31.0>, >>> {<0.78.0>,supervisor_report, >>> [{supervisor,{local,couch_**server_sup}}, >>> {errorContext,child_**terminated}, >>> {reason,shutdown}, >>> {offender, >>> [{pid,<0.86.0>}, >>> {name,couch_secondary_**services}, >>> {mfargs, >>> >>> {couch_server_sup,start_**secondary_services, >>> []}}, >>> {restart_type,permanent}, >>> {shutdown,infinity}, >>> {child_type,supervisor}]}]}} >>> [Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>] ** Generic server >>> <0.407.0> terminating >>> ** Last message in was delayed_commit >>> ** When Server state == {db,<0.406.0>,<0.407.0>,nil,<<** >>> "1323371423957954">>, >>> <0.404.0>,<0.408.0>, >>> {db_header,5,204982,0, >>> {199491055,{204980,0}}, >>> {199498140,204980}, >>> {111685732,[]}, >>> 0,nil,nil,1000}, >>> 204982, >>> {btree,<0.404.0>, >>> {199513565,{205011,0}}, >>> #Fun<couch_db_updater.10.**19222179>, >>> #Fun<couch_db_updater.11.**21515767>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.12.**93888648>}, >>> {btree,<0.404.0>, >>> {199518784,205011}, >>> #Fun<couch_db_updater.13.**40165027>, >>> #Fun<couch_db_updater.14.**82810239>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.15.104121193>}, >>> {btree,<0.404.0>, >>> {111685732,[]}, >>> #Fun<couch_btree.0.23070627>, >>> #Fun<couch_btree.1.117278773>, >>> #Fun<couch_btree.2.112258129>,**nil}, >>> 205013, >>> <<"database1">>, >>> "/data/couchdb/data/database1.**couch", >>> [],[],nil, >>> {user_ctx,null,[],undefined}, >>> #Ref<0.0.30.131014>,1000, >>> [before_header,after_header,**on_file_open], >>> false} >>> ** Reason for termination == >>> ** {timeout, >>> {gen_server,call, >>> [<0.406.0>, >>> {db_updated, >>> >>> {db,<0.406.0>,<0.407.0>,nil,<<**"1323371423957954">>,<0.404.0>**, >>> <0.408.0>, >>> {db_header,5,205013,0, >>> {199513565,{205011,0}}, >>> {199518784,205011}, >>> {111685732,[]}, >>> 0,nil,nil,1000}, >>> 205013, >>> {btree,<0.404.0>, >>> {199513565,{205011,0}}, >>> #Fun<couch_db_updater.10.**19222179>, >>> #Fun<couch_db_updater.11.**21515767>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.12.**93888648>}, >>> {btree,<0.404.0>, >>> {199518784,205011}, >>> #Fun<couch_db_updater.13.**40165027>, >>> #Fun<couch_db_updater.14.**82810239>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.15.104121193>}, >>> {btree,<0.404.0>, >>> {111685732,[]}, >>> #Fun<couch_btree.0.23070627>, >>> #Fun<couch_btree.1.117278773>, >>> #Fun<couch_btree.2.112258129>,**nil}, >>> 205013, >>> <<"database1">>, >>> "/data/couchdb/data/database1.**couch", >>> [],[],nil, >>> {user_ctx,null,[],undefined}, >>> nil,1000, >>> [before_header,after_header,**on_file_open], >>> false}}]}} >>> >>> [Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>] >>> {error_report,<0.31.0>, >>> {<0.407.0>,crash_report, >>> >>> [[{initial_call,{couch_db_**updater,init,['Argument__1']}}**, >>> {pid,<0.407.0>}, >>> {registered_name,[]}, >>> {error_info, >>> {exit, >>> {timeout, >>> {gen_server,call, >>> [<0.406.0>, >>> {db_updated, >>> {db,<0.406.0>,<0.407.0>,nil, >>> <<"1323371423957954">>,<0.404.** >>> 0>,<0.408.0>, >>> {db_header,5,205013,0, >>> {199513565,{205011,0}}, >>> {199518784,205011}, >>> {111685732,[]}, >>> 0,nil,nil,1000}, >>> 205013, >>> {btree,<0.404.0>, >>> {199513565,{205011,0}}, >>> #Fun<couch_db_updater.10.**19222179>, >>> #Fun<couch_db_updater.11.**21515767>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.12.**93888648>}, >>> {btree,<0.404.0>, >>> {199518784,205011}, >>> #Fun<couch_db_updater.13.**40165027>, >>> #Fun<couch_db_updater.14.**82810239>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.15.104121193>}, >>> {btree,<0.404.0>, >>> {111685732,[]}, >>> #Fun<couch_btree.0.23070627>, >>> #Fun<couch_btree.1.117278773>, >>> #Fun<couch_btree.2.112258129>,**nil}, >>> 205013, >>> <<"database1">>, >>> "/data/couchdb/data/database1.**couch", >>> [],[],nil, >>> {user_ctx,null,[],undefined}, >>> nil,1000, >>> [before_header,after_header,** >>> on_file_open], >>> false}}]}}, >>> [{gen_server,terminate,6}, >>> {proc_lib,init_p_do_apply,3}]}**}, >>> {ancestors,[<0.406.0>,<0.403.**0>]}, >>> {messages,[{'EXIT',<0.406.0>,**shutdown}]}, >>> {links,[]}, >>> {dictionary,[]}, >>> {trap_exit,true}, >>> {status,running}, >>> {heap_size,28657}, >>> {stack_size,24}, >>> {reductions,4487709}], >>> []]}} >>> [Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>] ** Generic server >>> <0.178.0> terminating >>> ** Last message in was {update_docs,<0.2027.0>, >>> [[{doc,<<"**55e776b94547442ab17b82bd1a0598** >>> 43">>, >>> {1, >>> [<<102,77,172,235,192,72,84,** >>> 223,58,68,105, >>> 199,153,147,196,81>>]}, >>> {[{<<"host">>,<<"Host1">>}, >>> {<<"time">>,1323375464000}, >>> {<<"text">>, >>> <<"Some text">>}, >>> {<<"level">>,0}, >>> {<<"source">>,<<"source1">>}, >>> {<<"type">>,<<"Entry1">>}]}, >>> [],false,[]}], >>> >>> ... >>> [[A BUNCH OF DOCS HERE]] >>> ... >>> >>> >>> {[{<<"host">>,<<"Host1">>}, >>> {<<"time">>,1323375467000}, >>> {<<"text">>, >>> <<"Some text">>}, >>> {<<"level">>,0}, >>> {<<"source">>,<<"source1">>}, >>> {<<"type">>,<<"Entry1">>}]}, >>> [],false,[]}]], >>> [],false,false} >>> ** When Server state == {db,<0.177.0>,<0.178.0>,nil,<<** >>> "1323371411352029">>, >>> <0.175.0>,<0.179.0>, >>> {db_header,5,13636863,0, >>> {6776455960,{13636861,0}}, >>> {6776479023,13636861}, >>> {1039786,[]}, >>> 0,nil,nil,1000}, >>> 13636863, >>> {btree,<0.175.0>, >>> {6776455960,{13636861,0}}, >>> #Fun<couch_db_updater.10.**19222179>, >>> #Fun<couch_db_updater.11.**21515767>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.12.**93888648>}, >>> {btree,<0.175.0>, >>> {6776479023,13636861}, >>> #Fun<couch_db_updater.13.**40165027>, >>> #Fun<couch_db_updater.14.**82810239>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.15.104121193>}, >>> {btree,<0.175.0>, >>> {1039786,[]}, >>> #Fun<couch_btree.0.23070627>, >>> #Fun<couch_btree.1.117278773>, >>> #Fun<couch_btree.2.112258129>,**nil}, >>> 13636863, >>> <<"database2">>, >>> "/data/couchdb/data/database2.**couch", >>> [],[],nil, >>> {user_ctx,null,[],undefined}, >>> nil,1000, >>> [before_header,after_header,**on_file_open], >>> false} >>> ** Reason for termination == >>> ** {timeout, >>> {gen_server,call, >>> [<0.177.0>, >>> {db_updated, >>> >>> {db,<0.177.0>,<0.178.0>,nil,<<**"1323371411352029">>,<0.175.0>**, >>> <0.179.0>, >>> {db_header,5,13636863,0, >>> {6776455960,{13636861,0}}, >>> {6776479023,13636861}, >>> {1039786,[]}, >>> 0,nil,nil,1000}, >>> 13636863, >>> {btree,<0.175.0>, >>> {6776557909,{13637061,0}}, >>> #Fun<couch_db_updater.10.**19222179>, >>> #Fun<couch_db_updater.11.**21515767>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.12.**93888648>}, >>> {btree,<0.175.0>, >>> {6776580448,13637061}, >>> #Fun<couch_db_updater.13.**40165027>, >>> #Fun<couch_db_updater.14.**82810239>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.15.104121193>}, >>> {btree,<0.175.0>, >>> {1039786,[]}, >>> #Fun<couch_btree.0.23070627>, >>> #Fun<couch_btree.1.117278773>, >>> #Fun<couch_btree.2.112258129>,**nil}, >>> 13637063, >>> <<"database2">>, >>> "/data/couchdb/data/database2.**couch", >>> [],[],nil, >>> {user_ctx,null,[],undefined}, >>> #Ref<0.0.30.133811>,1000, >>> [before_header,after_header,**on_file_open], >>> false}}]}} >>> >>> [Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>] >>> {error_report,<0.31.0>, >>> {<0.178.0>,crash_report, >>> >>> [[{initial_call,{couch_db_**updater,init,['Argument__1']}}**, >>> {pid,<0.178.0>}, >>> {registered_name,[]}, >>> {error_info, >>> {exit, >>> {timeout, >>> {gen_server,call, >>> [<0.177.0>, >>> {db_updated, >>> {db,<0.177.0>,<0.178.0>,nil, >>> <<"1323371411352029">>,<0.175.** >>> 0>,<0.179.0>, >>> {db_header,5,13636863,0, >>> {6776455960,{13636861,0}}, >>> {6776479023,13636861}, >>> {1039786,[]}, >>> 0,nil,nil,1000}, >>> 13636863, >>> {btree,<0.175.0>, >>> {6776557909,{13637061,0}}, >>> #Fun<couch_db_updater.10.**19222179>, >>> #Fun<couch_db_updater.11.**21515767>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.12.**93888648>}, >>> {btree,<0.175.0>, >>> {6776580448,13637061}, >>> #Fun<couch_db_updater.13.**40165027>, >>> #Fun<couch_db_updater.14.**82810239>, >>> #Fun<couch_btree.5.112258129>, >>> #Fun<couch_db_updater.15.104121193>}, >>> {btree,<0.175.0>, >>> {1039786,[]}, >>> #Fun<couch_btree.0.23070627>, >>> #Fun<couch_btree.1.117278773>, >>> #Fun<couch_btree.2.112258129>,**nil}, >>> 13637063, >>> <<"database2">>, >>> "/data/couchdb/data/database2.**couch", >>> [],[],nil, >>> {user_ctx,null,[],undefined}, >>> #Ref<0.0.30.133811>,1000, >>> [before_header,after_header,** >>> on_file_open], >>> false}}]}}, >>> [{gen_server,terminate,6}, >>> {proc_lib,init_p_do_apply,3}]}**}, >>> {ancestors,[<0.177.0>,<0.174.**0>]}, >>> {messages, >>> [{'EXIT',<0.177.0>,shutdown},** >>> delayed_commit]}, >>> {links,[]}, >>> {dictionary,[]}, >>> {trap_exit,true}, >>> {status,running}, >>> {heap_size,121393}, >>> {stack_size,24}, >>> {reductions,83311172}], >>> []]}} >>> >>> >> >
