Hi CGS,
Thanks for your help! I dug deeper in that log file and found a lot of "OS
process timed out." errors generated both by the <0.86.0> process working
with one database and other processes working with other databases. I think
the process in question is the python script that forwards requests to
couchdb-lucene. Now, right after the last timeout error, I see this:
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
{<0.86.0>,supervisor_report,
[{supervisor,{local,couch_secondary_services}},
{errorContext,shutdown},
{reason,reached_max_restart_intensity},
{offender,
[{pid,<0.8520.1>},
{name,external_manager},
{mfargs,{couch_external_manager,start_link,[]}},
{restart_type,permanent},
{shutdown,1000},
{child_type,worker}]}]}}
...with the "reached_max_restart_intensity" part looking fishy to me. I
couldn't find any docs on that one, but I assume that the process dies too
often and CouchDB gives up restarting it. But why would that bring the
entire DB server down instead of "just" breaking couchdb-lucene queries?
I already did some changes that should make the Lucene timeouts happen a
lot less often. Is there a setting that I can change and up the restart
intensity value?
I checked that the disk space and max file sizes are okay. I have both
enough free space and I host larger files on that filesystem. I am pasting
the log file snippet below.
Best,
Hristo
===========
[Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
{<0.86.0>,supervisor_report,
[{supervisor,{local,couch_secondary_services}},
{errorContext,child_terminated},
{reason,normal},
{offender,
[{pid,<0.132.0>},
{name,external_manager},
{mfargs,{couch_external_manager,start_link,[]}},
{restart_type,permanent},
{shutdown,1000},
{child_type,worker}]}]}}
[Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.486.0>] Uncaught error in HTTP
request: {exit,
{{bad_return_value,
{os_process_error,
"OS process timed
out."}},
{gen_server,call,
[<0.525.0>,
{execute,
{[{<<"info">>,
{[{db_name,
<<"database1">>},
{doc_count,
13343011},
{doc_del_count,
0},
{update_seq,
13343013},
{purge_seq,0},
{compact_running,
false},
{disk_size,
6628601611},
{instance_start_time,
<<"1323371411352029">>},
{disk_format_version,
5},
{committed_update_seq,
13342813}]}},
{<<"id">>,null},
{<<"uuid">>,
<<"0b482d09cd97c1bada0d53cfab9f14b2">>},
{<<"method">>,
'GET'},
{<<"requested_path">>,
[<<"database1">>,
<<"_fti">>,
<<"_design">>,
<<"log-search">>,
<<"everything?q=text:error&limit=1">>]},
{<<"path">>,
[<<"database1">>,
<<"_fti">>,
<<"_design">>,
<<"log-search">>,
<<"everything">>]},
{<<"query">>,
{[{<<"q">>,
<<"text:error">>},
{<<"limit">>,
<<"1">>}]}},
{<<"headers">>,
{[{<<"Connection">>,
<<"Keep-Alive">>},
{<<"Host">>,
<<"couch2.internal:5984">>}]}},
{<<"body">>,
undefined},
{<<"peer">>,
<<"127.0.0.1">>},
{<<"form">>,{[]}},
{<<"cookie">>,{[]}},
{<<"userCtx">>,
{[{<<"db">>,
<<"database1">>},
{<<"name">>,
null},
{<<"roles">>,
[<<"_admin">>]}]}},
{<<"secObj">>,
{[]}}]}},
infinity]}}}
...
[A LOT MORE MESSAGES LIKE THE PREVIOUS ONE]
...
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
{<0.86.0>,supervisor_report,
[{supervisor,{local,couch_secondary_services}},
{errorContext,child_terminated},
{reason,normal},
{offender,
[{pid,<0.8520.1>},
{name,external_manager},
{mfargs,{couch_external_manager,start_link,[]}},
{restart_type,permanent},
{shutdown,1000},
{child_type,worker}]}]}}
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
{<0.86.0>,supervisor_report,
[{supervisor,{local,couch_secondary_services}},
{errorContext,shutdown},
{reason,reached_max_restart_intensity},
{offender,
[{pid,<0.8520.1>},
{name,external_manager},
{mfargs,{couch_external_manager,start_link,[]}},
{restart_type,permanent},
{shutdown,1000},
{child_type,worker}]}]}}
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.486.0>] Uncaught error in HTTP
request: {exit,
{{bad_return_value,
{os_process_error,
"OS process timed
out."}},
{gen_server,call,
[<0.8529.1>,
{execute,
{[{<<"info">>,
{[{db_name,
<<"database2">>},
{doc_count,
7416681},
{doc_del_count,
0},
{update_seq,
7416683},
{purge_seq,0},
{compact_running,
false},
{disk_size,
4226396124},
{instance_start_time,
<<"1323371428396966">>},
{disk_format_version,
5},
{committed_update_seq,
7416483}]}},
{<<"id">>,null},
{<<"uuid">>,
<<"87792b1884f91df492a069ec2e3d76a8">>},
{<<"method">>,
'GET'},
{<<"requested_path">>,
[<<"database2">>,
<<"_fti">>,
<<"_design">>,
<<"log-search">>,
<<"everything?q=text:error&limit=1">>]},
{<<"path">>,
[<<"database2">>,
<<"_fti">>,
<<"_design">>,
<<"log-search">>,
<<"everything">>]},
{<<"query">>,
{[{<<"q">>,
<<"text:error">>},
{<<"limit">>,
<<"1">>}]}},
{<<"headers">>,
{[{<<"Connection">>,
<<"Keep-Alive">>},
{<<"Host">>,
<<"couch2.internal:5984">>}]}},
{<<"body">>,
undefined},
{<<"peer">>,
<<"127.0.0.1">>},
{<<"form">>,{[]}},
{<<"cookie">>,{[]}},
{<<"userCtx">>,
{[{<<"db">>,
<<"database2">>},
{<<"name">>,
null},
{<<"roles">>,
[<<"_admin">>]}]}},
{<<"secObj">>,
{[]}}]}},
infinity]}}}
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>] {error_report,<0.31.0>,
{<0.78.0>,supervisor_report,
[{supervisor,{local,couch_server_sup}},
{errorContext,child_terminated},
{reason,shutdown},
{offender,
[{pid,<0.86.0>},
{name,couch_secondary_services},
{mfargs,
{couch_server_sup,start_secondary_services,
[]}},
{restart_type,permanent},
{shutdown,infinity},
{child_type,supervisor}]}]}}
===========
On Fri, Dec 9, 2011 at 11:10 AM, CGS <[email protected]> wrote:
> Hi Hristo,
>
> The problem is before what you presented from your log. For example, could
> you copy-paste information about process <0.86.0>? From the look of the
> log, it seems something (I couldn't find what from the part of the log you
> presented, but it seems that process <0.86.0> could provide some hints
> about that something) is crashing the generic server (without stopping the
> application) repeatedly and that's the reason for which you cannot access
> your documents. That explains why you are able to "fix" the problem by
> restarting CouchDB and also your message "OS process has timed out" (which
> is the effect, and not the cause of that problem). What is crashing your
> server, no idea, but the log should be able to provide the necessary
> information.
>
> It may be also a good idea to check the space on your harddisk or the
> maximum file size allowed by your OS. But firstly, I would check what
> information that process <0.86.0> could provide.
>
> CGS
>
>
>
>
>
> On 12/08/2011 10:41 PM, Hristo Deshev wrote:
>
>> Hi everyone,
>>
>> I moved some data from an Amazon EC2 small instance to a large one and in
>> the process upgraded from CouchDB 1.1.0 to CouchDB 1.1.1. I also went with
>> Erlang R14B04 instead of R14B03 (Hurray for commando updates!) and now my
>> CouchDB instance seems to sometimes die when under heavy load. By "dying"
>> I
>> mean that the beam process seems to stay in memory, but the HTTP server is
>> gone and no requests get served. For now I "fix" this by stopping and
>> restarting the process.
>>
>> Here are some details on my setup. The server is running a 64-bit Ubuntu
>> Server (Oneiric) Amazon EC2 image on a large instance with 2 CPU cores and
>> 7.5 GB RAM. I build both Erlang and CouchDB from source. I collect log
>> entries and bulk insert them in batches of up to 200 documents. I also run
>> couchdb-lucene on the same host and I *think* most of the crashes happen
>> when couchdb-lucene is running a tough query and is hogging the CPU or the
>> HDD. I have some largish db's (~50 million documents, ~25 GB in disk
>> space). I plan on splitting my dbs into smaller ones. I hope that gets me
>> more responsive file access and faster full text index searches. I think
>> my
>> lucene indexes may be getting too large for that machine's memory and it
>> can't serve them too well. I frequently get "OS process has timed out"
>> errors when trying to query those indexes. Anyway, that shouldn't be
>> crashing the core couchdb process, right?
>>
>> I am pasting my idea of what the relevant portion of the couchdb log file
>> is below, hoping somebody could decipher something out of it. Am I correct
>> in thinking that the "** Reason for termination == ** {timeout," part
>> means
>> the process is crashing since writing to or reading from a file timed out?
>> Any help is greatly appreciated.
>>
>> Best,
>> Hristo
>>
>> ===============
>>
>> [Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>] {error_report,<0.31.0>,
>> {<0.78.0>,supervisor_report,
>> [{supervisor,{local,couch_**server_sup}},
>> {errorContext,child_**terminated},
>> {reason,shutdown},
>> {offender,
>> [{pid,<0.86.0>},
>> {name,couch_secondary_**services},
>> {mfargs,
>>
>> {couch_server_sup,start_**secondary_services,
>> []}},
>> {restart_type,permanent},
>> {shutdown,infinity},
>> {child_type,supervisor}]}]}}
>> [Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>] ** Generic server
>> <0.407.0> terminating
>> ** Last message in was delayed_commit
>> ** When Server state == {db,<0.406.0>,<0.407.0>,nil,<<**
>> "1323371423957954">>,
>> <0.404.0>,<0.408.0>,
>> {db_header,5,204982,0,
>> {199491055,{204980,0}},
>> {199498140,204980},
>> {111685732,[]},
>> 0,nil,nil,1000},
>> 204982,
>> {btree,<0.404.0>,
>> {199513565,{205011,0}},
>> #Fun<couch_db_updater.10.**19222179>,
>> #Fun<couch_db_updater.11.**21515767>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.12.**93888648>},
>> {btree,<0.404.0>,
>> {199518784,205011},
>> #Fun<couch_db_updater.13.**40165027>,
>> #Fun<couch_db_updater.14.**82810239>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.15.104121193>},
>> {btree,<0.404.0>,
>> {111685732,[]},
>> #Fun<couch_btree.0.23070627>,
>> #Fun<couch_btree.1.117278773>,
>> #Fun<couch_btree.2.112258129>,**nil},
>> 205013,
>> <<"database1">>,
>> "/data/couchdb/data/database1.**couch",
>> [],[],nil,
>> {user_ctx,null,[],undefined},
>> #Ref<0.0.30.131014>,1000,
>> [before_header,after_header,**on_file_open],
>> false}
>> ** Reason for termination ==
>> ** {timeout,
>> {gen_server,call,
>> [<0.406.0>,
>> {db_updated,
>>
>> {db,<0.406.0>,<0.407.0>,nil,<<**"1323371423957954">>,<0.404.0>**,
>> <0.408.0>,
>> {db_header,5,205013,0,
>> {199513565,{205011,0}},
>> {199518784,205011},
>> {111685732,[]},
>> 0,nil,nil,1000},
>> 205013,
>> {btree,<0.404.0>,
>> {199513565,{205011,0}},
>> #Fun<couch_db_updater.10.**19222179>,
>> #Fun<couch_db_updater.11.**21515767>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.12.**93888648>},
>> {btree,<0.404.0>,
>> {199518784,205011},
>> #Fun<couch_db_updater.13.**40165027>,
>> #Fun<couch_db_updater.14.**82810239>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.15.104121193>},
>> {btree,<0.404.0>,
>> {111685732,[]},
>> #Fun<couch_btree.0.23070627>,
>> #Fun<couch_btree.1.117278773>,
>> #Fun<couch_btree.2.112258129>,**nil},
>> 205013,
>> <<"database1">>,
>> "/data/couchdb/data/database1.**couch",
>> [],[],nil,
>> {user_ctx,null,[],undefined},
>> nil,1000,
>> [before_header,after_header,**on_file_open],
>> false}}]}}
>>
>> [Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>]
>> {error_report,<0.31.0>,
>> {<0.407.0>,crash_report,
>>
>> [[{initial_call,{couch_db_**updater,init,['Argument__1']}}**,
>> {pid,<0.407.0>},
>> {registered_name,[]},
>> {error_info,
>> {exit,
>> {timeout,
>> {gen_server,call,
>> [<0.406.0>,
>> {db_updated,
>> {db,<0.406.0>,<0.407.0>,nil,
>> <<"1323371423957954">>,<0.404.**
>> 0>,<0.408.0>,
>> {db_header,5,205013,0,
>> {199513565,{205011,0}},
>> {199518784,205011},
>> {111685732,[]},
>> 0,nil,nil,1000},
>> 205013,
>> {btree,<0.404.0>,
>> {199513565,{205011,0}},
>> #Fun<couch_db_updater.10.**19222179>,
>> #Fun<couch_db_updater.11.**21515767>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.12.**93888648>},
>> {btree,<0.404.0>,
>> {199518784,205011},
>> #Fun<couch_db_updater.13.**40165027>,
>> #Fun<couch_db_updater.14.**82810239>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.15.104121193>},
>> {btree,<0.404.0>,
>> {111685732,[]},
>> #Fun<couch_btree.0.23070627>,
>> #Fun<couch_btree.1.117278773>,
>> #Fun<couch_btree.2.112258129>,**nil},
>> 205013,
>> <<"database1">>,
>> "/data/couchdb/data/database1.**couch",
>> [],[],nil,
>> {user_ctx,null,[],undefined},
>> nil,1000,
>> [before_header,after_header,**
>> on_file_open],
>> false}}]}},
>> [{gen_server,terminate,6},
>> {proc_lib,init_p_do_apply,3}]}**},
>> {ancestors,[<0.406.0>,<0.403.**0>]},
>> {messages,[{'EXIT',<0.406.0>,**shutdown}]},
>> {links,[]},
>> {dictionary,[]},
>> {trap_exit,true},
>> {status,running},
>> {heap_size,28657},
>> {stack_size,24},
>> {reductions,4487709}],
>> []]}}
>> [Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>] ** Generic server
>> <0.178.0> terminating
>> ** Last message in was {update_docs,<0.2027.0>,
>> [[{doc,<<"**55e776b94547442ab17b82bd1a0598**
>> 43">>,
>> {1,
>> [<<102,77,172,235,192,72,84,**
>> 223,58,68,105,
>> 199,153,147,196,81>>]},
>> {[{<<"host">>,<<"Host1">>},
>> {<<"time">>,1323375464000},
>> {<<"text">>,
>> <<"Some text">>},
>> {<<"level">>,0},
>> {<<"source">>,<<"source1">>},
>> {<<"type">>,<<"Entry1">>}]},
>> [],false,[]}],
>>
>> ...
>> [[A BUNCH OF DOCS HERE]]
>> ...
>>
>>
>> {[{<<"host">>,<<"Host1">>},
>> {<<"time">>,1323375467000},
>> {<<"text">>,
>> <<"Some text">>},
>> {<<"level">>,0},
>> {<<"source">>,<<"source1">>},
>> {<<"type">>,<<"Entry1">>}]},
>> [],false,[]}]],
>> [],false,false}
>> ** When Server state == {db,<0.177.0>,<0.178.0>,nil,<<**
>> "1323371411352029">>,
>> <0.175.0>,<0.179.0>,
>> {db_header,5,13636863,0,
>> {6776455960,{13636861,0}},
>> {6776479023,13636861},
>> {1039786,[]},
>> 0,nil,nil,1000},
>> 13636863,
>> {btree,<0.175.0>,
>> {6776455960,{13636861,0}},
>> #Fun<couch_db_updater.10.**19222179>,
>> #Fun<couch_db_updater.11.**21515767>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.12.**93888648>},
>> {btree,<0.175.0>,
>> {6776479023,13636861},
>> #Fun<couch_db_updater.13.**40165027>,
>> #Fun<couch_db_updater.14.**82810239>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.15.104121193>},
>> {btree,<0.175.0>,
>> {1039786,[]},
>> #Fun<couch_btree.0.23070627>,
>> #Fun<couch_btree.1.117278773>,
>> #Fun<couch_btree.2.112258129>,**nil},
>> 13636863,
>> <<"database2">>,
>> "/data/couchdb/data/database2.**couch",
>> [],[],nil,
>> {user_ctx,null,[],undefined},
>> nil,1000,
>> [before_header,after_header,**on_file_open],
>> false}
>> ** Reason for termination ==
>> ** {timeout,
>> {gen_server,call,
>> [<0.177.0>,
>> {db_updated,
>>
>> {db,<0.177.0>,<0.178.0>,nil,<<**"1323371411352029">>,<0.175.0>**,
>> <0.179.0>,
>> {db_header,5,13636863,0,
>> {6776455960,{13636861,0}},
>> {6776479023,13636861},
>> {1039786,[]},
>> 0,nil,nil,1000},
>> 13636863,
>> {btree,<0.175.0>,
>> {6776557909,{13637061,0}},
>> #Fun<couch_db_updater.10.**19222179>,
>> #Fun<couch_db_updater.11.**21515767>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.12.**93888648>},
>> {btree,<0.175.0>,
>> {6776580448,13637061},
>> #Fun<couch_db_updater.13.**40165027>,
>> #Fun<couch_db_updater.14.**82810239>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.15.104121193>},
>> {btree,<0.175.0>,
>> {1039786,[]},
>> #Fun<couch_btree.0.23070627>,
>> #Fun<couch_btree.1.117278773>,
>> #Fun<couch_btree.2.112258129>,**nil},
>> 13637063,
>> <<"database2">>,
>> "/data/couchdb/data/database2.**couch",
>> [],[],nil,
>> {user_ctx,null,[],undefined},
>> #Ref<0.0.30.133811>,1000,
>> [before_header,after_header,**on_file_open],
>> false}}]}}
>>
>> [Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>]
>> {error_report,<0.31.0>,
>> {<0.178.0>,crash_report,
>>
>> [[{initial_call,{couch_db_**updater,init,['Argument__1']}}**,
>> {pid,<0.178.0>},
>> {registered_name,[]},
>> {error_info,
>> {exit,
>> {timeout,
>> {gen_server,call,
>> [<0.177.0>,
>> {db_updated,
>> {db,<0.177.0>,<0.178.0>,nil,
>> <<"1323371411352029">>,<0.175.**
>> 0>,<0.179.0>,
>> {db_header,5,13636863,0,
>> {6776455960,{13636861,0}},
>> {6776479023,13636861},
>> {1039786,[]},
>> 0,nil,nil,1000},
>> 13636863,
>> {btree,<0.175.0>,
>> {6776557909,{13637061,0}},
>> #Fun<couch_db_updater.10.**19222179>,
>> #Fun<couch_db_updater.11.**21515767>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.12.**93888648>},
>> {btree,<0.175.0>,
>> {6776580448,13637061},
>> #Fun<couch_db_updater.13.**40165027>,
>> #Fun<couch_db_updater.14.**82810239>,
>> #Fun<couch_btree.5.112258129>,
>> #Fun<couch_db_updater.15.104121193>},
>> {btree,<0.175.0>,
>> {1039786,[]},
>> #Fun<couch_btree.0.23070627>,
>> #Fun<couch_btree.1.117278773>,
>> #Fun<couch_btree.2.112258129>,**nil},
>> 13637063,
>> <<"database2">>,
>> "/data/couchdb/data/database2.**couch",
>> [],[],nil,
>> {user_ctx,null,[],undefined},
>> #Ref<0.0.30.133811>,1000,
>> [before_header,after_header,**
>> on_file_open],
>> false}}]}},
>> [{gen_server,terminate,6},
>> {proc_lib,init_p_do_apply,3}]}**},
>> {ancestors,[<0.177.0>,<0.174.**0>]},
>> {messages,
>> [{'EXIT',<0.177.0>,shutdown},**delayed_commit]},
>> {links,[]},
>> {dictionary,[]},
>> {trap_exit,true},
>> {status,running},
>> {heap_size,121393},
>> {stack_size,24},
>> {reductions,83311172}],
>> []]}}
>>
>>
>