Hi, Hristo,

Firstly, forgive my absence, I haven't had time to take care of my e-mails.

Secondly, yes, it seems a request is blocking your process and SpiderMonkey spits an error which is not caught by CouchDB (incompatibility between your SpiderMonkey and your version of CouchDB). I suppose CouchDB is used to get a number to describe that kind of error, while your SpiderMonkey is reporting human readable error description. To avoid that, you need to check for flaws in your filter. On the other hand, maybe a developer will take care of SpiderMonkey error description compatibility.

More than that, you can only follow the chain of "offender" to see when the first crash was registered and to eliminate the source of crash if possible (it may be that your process takes too long and SpiderMonkey stops the process while reporting an error description which is not implemented in CouchDB Erlang core code - here, only splitting the db can help; otherwise, identifying the document where your request is crashing might help).

CGS




On 12/12/2011 11:53 AM, Hristo Deshev wrote:
Oh, well, I haven't seen that crash for a couple of days. For me, the
solution to the problem is getting rid of the (crashing) Python process
that forwards requests to couchdb-lucene. The latest versions of
couchdb-lucene don't need it at all - they can work with CouchDB's proxying
feature. That isn't too straightforward since I am using the "ektorplucene"
library on the JVM which doesn't support the new URL scheme though.

I'll probably tweak the library a bit, get rid of the Python process, and
report back.

Best,
Hristo


On Fri, Dec 9, 2011 at 2:24 PM, Hristo Deshev<[email protected]>  wrote:

Hi CGS,

Thanks for your help! I dug deeper in that log file and found a lot of "OS
process timed out." errors generated both by the<0.86.0>  process working
with one database and other processes working with other databases. I think
the process in question is the python script that forwards requests to
couchdb-lucene. Now, right after the last timeout error, I see this:

[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
                        {<0.86.0>,supervisor_report,
                         [{supervisor,{local,couch_secondary_services}},
                          {errorContext,shutdown},
                          {reason,reached_max_restart_intensity},
                          {offender,
                              [{pid,<0.8520.1>},
                               {name,external_manager},

{mfargs,{couch_external_manager,start_link,[]}},
                               {restart_type,permanent},
                               {shutdown,1000},
                               {child_type,worker}]}]}}

...with the "reached_max_restart_intensity" part looking fishy to me. I
couldn't find any docs on that one, but I assume that the process dies too
often and CouchDB gives up restarting it. But why would that bring the
entire DB server down instead of "just" breaking couchdb-lucene queries?

I already did some changes that should make the Lucene timeouts happen a
lot less often. Is there a setting that I can change and up the restart
intensity value?

I checked that the disk space and max file sizes are okay. I have both
enough free space and I host larger files on that filesystem. I am pasting
the log file snippet below.

Best,
Hristo

===========

[Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
                        {<0.86.0>,supervisor_report,
                         [{supervisor,{local,couch_secondary_services}},
                          {errorContext,child_terminated},
                          {reason,normal},
                          {offender,
                              [{pid,<0.132.0>},
                               {name,external_manager},

{mfargs,{couch_external_manager,start_link,[]}},
                               {restart_type,permanent},
                               {shutdown,1000},
                               {child_type,worker}]}]}}
[Thu, 08 Dec 2011 19:11:17 GMT] [error] [<0.486.0>] Uncaught error in HTTP
request: {exit,
                                                      {{bad_return_value,
                                                        {os_process_error,
                                                         "OS process timed
out."}},
                                                       {gen_server,call,
                                                        [<0.525.0>,
                                                         {execute,
                                                          {[{<<"info">>,
                                                             {[{db_name,

  <<"database1">>},
                                                               {doc_count,
                                                                13343011},

{doc_del_count,
                                                                0},
                                                               {update_seq,
                                                                13343013},

{purge_seq,0},

{compact_running,
                                                                false},
                                                               {disk_size,
                                                                6628601611
},

{instance_start_time,

  <<"1323371411352029">>},

{disk_format_version,
                                                                5},

{committed_update_seq,

  13342813}]}},
                                                            {<<"id">>,null},
                                                            {<<"uuid">>,

<<"0b482d09cd97c1bada0d53cfab9f14b2">>},
                                                            {<<"method">>,
                                                             'GET'},

  {<<"requested_path">>,

[<<"database1">>,
                                                              <<"_fti">>,
                                                              <<"_design">>,

  <<"log-search">>,

  <<"everything?q=text:error&limit=1">>]},
                                                            {<<"path">>,

[<<"database1">>,
                                                              <<"_fti">>,
                                                              <<"_design">>,

  <<"log-search">>,

  <<"everything">>]},
                                                            {<<"query">>,
                                                             {[{<<"q">>,

  <<"text:error">>},
                                                               {<<"limit">>,
                                                                <<"1">>}]}},
                                                            {<<"headers">>,

{[{<<"Connection">>,

  <<"Keep-Alive">>},
                                                               {<<"Host">>,

  <<"couch2.internal:5984">>}]}},
                                                            {<<"body">>,
                                                             undefined},
                                                            {<<"peer">>,

<<"127.0.0.1">>},

  {<<"form">>,{[]}},

  {<<"cookie">>,{[]}},
                                                            {<<"userCtx">>,
                                                             {[{<<"db">>,

  <<"database1">>},
                                                               {<<"name">>,
                                                                null},
                                                               {<<"roles">>,

  [<<"_admin">>]}]}},
                                                            {<<"secObj">>,
                                                             {[]}}]}},
                                                         infinity]}}}


...
[A LOT MORE MESSAGES LIKE THE PREVIOUS ONE]
...

[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
                        {<0.86.0>,supervisor_report,
                         [{supervisor,{local,couch_secondary_services}},
                          {errorContext,child_terminated},
                          {reason,normal},
                          {offender,
                              [{pid,<0.8520.1>},
                               {name,external_manager},

{mfargs,{couch_external_manager,start_link,[]}},
                               {restart_type,permanent},
                               {shutdown,1000},
                               {child_type,worker}]}]}}
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.86.0>] {error_report,<0.31.0>,
                        {<0.86.0>,supervisor_report,
                         [{supervisor,{local,couch_secondary_services}},
                          {errorContext,shutdown},
                          {reason,reached_max_restart_intensity},
                          {offender,
                              [{pid,<0.8520.1>},
                               {name,external_manager},

{mfargs,{couch_external_manager,start_link,[]}},
                               {restart_type,permanent},
                               {shutdown,1000},
                               {child_type,worker}]}]}}
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.486.0>] Uncaught error in HTTP
request: {exit,
                                                      {{bad_return_value,
                                                        {os_process_error,
                                                         "OS process timed
out."}},
                                                       {gen_server,call,
                                                        [<0.8529.1>,
                                                         {execute,
                                                          {[{<<"info">>,
                                                             {[{db_name,

  <<"database2">>},
                                                               {doc_count,
                                                                7416681},

{doc_del_count,
                                                                0},
                                                               {update_seq,
                                                                7416683},

{purge_seq,0},

{compact_running,
                                                                false},
                                                               {disk_size,
                                                                4226396124},

{instance_start_time,

  <<"1323371428396966">>},

{disk_format_version,
                                                                5},

{committed_update_seq,
                                                                7416483}]}},
                                                            {<<"id">>,null},
                                                            {<<"uuid">>,

<<"87792b1884f91df492a069ec2e3d76a8">>},
                                                            {<<"method">>,
                                                             'GET'},

  {<<"requested_path">>,

[<<"database2">>,
                                                              <<"_fti">>,
                                                              <<"_design">>,

  <<"log-search">>,

  <<"everything?q=text:error&limit=1">>]},
                                                            {<<"path">>,

[<<"database2">>,
                                                              <<"_fti">>,
                                                              <<"_design">>,

  <<"log-search">>,

  <<"everything">>]},
                                                            {<<"query">>,
                                                             {[{<<"q">>,

  <<"text:error">>},
                                                               {<<"limit">>,
                                                                <<"1">>}]}},
                                                            {<<"headers">>,

{[{<<"Connection">>,

  <<"Keep-Alive">>},
                                                               {<<"Host">>,

  <<"couch2.internal:5984">>}]}},
                                                            {<<"body">>,
                                                             undefined},
                                                            {<<"peer">>,

<<"127.0.0.1">>},

  {<<"form">>,{[]}},

  {<<"cookie">>,{[]}},
                                                            {<<"userCtx">>,
                                                             {[{<<"db">>,

  <<"database2">>},
                                                               {<<"name">>,
                                                                null},
                                                               {<<"roles">>,

  [<<"_admin">>]}]}},
                                                            {<<"secObj">>,
                                                             {[]}}]}},
                                                         infinity]}}}
[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>] {error_report,<0.31.0>,
                        {<0.78.0>,supervisor_report,
                         [{supervisor,{local,couch_server_sup}},
                          {errorContext,child_terminated},
                          {reason,shutdown},
                          {offender,
                              [{pid,<0.86.0>},
                               {name,couch_secondary_services},
                               {mfargs,

{couch_server_sup,start_secondary_services,
                                       []}},
                               {restart_type,permanent},
                               {shutdown,infinity},
                               {child_type,supervisor}]}]}}
===========


On Fri, Dec 9, 2011 at 11:10 AM, CGS<[email protected]>  wrote:

Hi Hristo,

The problem is before what you presented from your log. For example,
could you copy-paste information about process<0.86.0>? From the look of
the log, it seems something (I couldn't find what from the part of the log
you presented, but it seems that process<0.86.0>  could provide some hints
about that something) is crashing the generic server (without stopping the
application) repeatedly and that's the reason for which you cannot access
your documents. That explains why you are able to "fix" the problem by
restarting CouchDB and also your message "OS process has timed out" (which
is the effect, and not the cause of that problem). What is crashing your
server, no idea, but the log should be able to provide the necessary
information.

It may be also a good idea to check the space on your harddisk or the
maximum file size allowed by your OS. But firstly, I would check what
information that process<0.86.0>  could provide.

CGS





On 12/08/2011 10:41 PM, Hristo Deshev wrote:

Hi everyone,

I moved some data from an Amazon EC2 small instance to a large one and in
the process upgraded from CouchDB 1.1.0 to CouchDB 1.1.1. I also went
with
Erlang R14B04 instead of R14B03 (Hurray for commando updates!) and now my
CouchDB instance seems to sometimes die when under heavy load. By
"dying" I
mean that the beam process seems to stay in memory, but the HTTP server
is
gone and no requests get served. For now I "fix" this by stopping and
restarting the process.

Here are some details on my setup. The server is running a 64-bit Ubuntu
Server (Oneiric) Amazon EC2 image on a large instance with 2 CPU cores
and
7.5 GB RAM. I build both Erlang and CouchDB from source. I collect log
entries and bulk insert them in batches of up to 200 documents. I also
run
couchdb-lucene on the same host and I *think* most of the crashes happen
when couchdb-lucene is running a tough query and is hogging the CPU or
the
HDD. I have some largish db's (~50 million documents, ~25 GB in disk
space). I plan on splitting my dbs into smaller ones. I hope that gets me
more responsive file access and faster full text index searches. I think
my
lucene indexes may be getting too large for that machine's memory and it
can't serve them too well. I frequently get "OS process has timed out"
errors when trying to query those indexes. Anyway, that shouldn't be
crashing the core couchdb process, right?

I am pasting my idea of what the relevant portion of the couchdb log file
is below, hoping somebody could decipher something out of it. Am I
correct
in thinking that the "** Reason for termination == ** {timeout," part
means
the process is crashing since writing to or reading from a file timed
out?
Any help is greatly appreciated.

Best,
Hristo

===============

[Thu, 08 Dec 2011 20:17:16 GMT] [error] [<0.78.0>]
{error_report,<0.31.0>,
                        {<0.78.0>,supervisor_report,
                         [{supervisor,{local,couch_**server_sup}},
                          {errorContext,child_**terminated},
                          {reason,shutdown},
                          {offender,
                              [{pid,<0.86.0>},
                               {name,couch_secondary_**services},
                               {mfargs,

{couch_server_sup,start_**secondary_services,
                                       []}},
                               {restart_type,permanent},
                               {shutdown,infinity},
                               {child_type,supervisor}]}]}}
[Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>] ** Generic server
<0.407.0>   terminating
** Last message in was delayed_commit
** When Server state == {db,<0.406.0>,<0.407.0>,nil,<<**
"1323371423957954">>,
                             <0.404.0>,<0.408.0>,
                             {db_header,5,204982,0,
                                 {199491055,{204980,0}},
                                 {199498140,204980},
                                 {111685732,[]},
                                 0,nil,nil,1000},
                             204982,
                             {btree,<0.404.0>,
                                 {199513565,{205011,0}},
                                 #Fun<couch_db_updater.10.**19222179>,
                                 #Fun<couch_db_updater.11.**21515767>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.12.**93888648>},
                             {btree,<0.404.0>,
                                 {199518784,205011},
                                 #Fun<couch_db_updater.13.**40165027>,
                                 #Fun<couch_db_updater.14.**82810239>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.15.104121193>},
                             {btree,<0.404.0>,
                                 {111685732,[]},
                                 #Fun<couch_btree.0.23070627>,
                                 #Fun<couch_btree.1.117278773>,
                                 #Fun<couch_btree.2.112258129>,**nil},
                             205013,
                             <<"database1">>,
                             "/data/couchdb/data/database1.**couch",
                             [],[],nil,
                             {user_ctx,null,[],undefined},
                             #Ref<0.0.30.131014>,1000,
                             [before_header,after_header,**on_file_open],
                             false}
** Reason for termination ==
** {timeout,
        {gen_server,call,
            [<0.406.0>,
             {db_updated,

{db,<0.406.0>,<0.407.0>,nil,<<**"1323371423957954">>,<0.404.0>**,
                     <0.408.0>,
                     {db_header,5,205013,0,
                         {199513565,{205011,0}},
                         {199518784,205011},
                         {111685732,[]},
                         0,nil,nil,1000},
                     205013,
                     {btree,<0.404.0>,
                         {199513565,{205011,0}},
                         #Fun<couch_db_updater.10.**19222179>,
                         #Fun<couch_db_updater.11.**21515767>,
                         #Fun<couch_btree.5.112258129>,
                         #Fun<couch_db_updater.12.**93888648>},
                     {btree,<0.404.0>,
                         {199518784,205011},
                         #Fun<couch_db_updater.13.**40165027>,
                         #Fun<couch_db_updater.14.**82810239>,
                         #Fun<couch_btree.5.112258129>,
                         #Fun<couch_db_updater.15.104121193>},
                     {btree,<0.404.0>,
                         {111685732,[]},
                         #Fun<couch_btree.0.23070627>,
                         #Fun<couch_btree.1.117278773>,
                         #Fun<couch_btree.2.112258129>,**nil},
                     205013,
                     <<"database1">>,
                     "/data/couchdb/data/database1.**couch",
                     [],[],nil,
                     {user_ctx,null,[],undefined},
                     nil,1000,
                     [before_header,after_header,**on_file_open],
                     false}}]}}

[Thu, 08 Dec 2011 20:17:21 GMT] [error] [<0.407.0>]
{error_report,<0.31.0>,
                      {<0.407.0>,crash_report,

[[{initial_call,{couch_db_**updater,init,['Argument__1']}}**,
                         {pid,<0.407.0>},
                         {registered_name,[]},
                         {error_info,
                          {exit,
                           {timeout,
                            {gen_server,call,
                             [<0.406.0>,
                              {db_updated,
                               {db,<0.406.0>,<0.407.0>,nil,
                                <<"1323371423957954">>,<0.404.**
0>,<0.408.0>,
                                {db_header,5,205013,0,
                                 {199513565,{205011,0}},
                                 {199518784,205011},
                                 {111685732,[]},
                                 0,nil,nil,1000},
                                205013,
                                {btree,<0.404.0>,
                                 {199513565,{205011,0}},
                                 #Fun<couch_db_updater.10.**19222179>,
                                 #Fun<couch_db_updater.11.**21515767>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.12.**93888648>},
                                {btree,<0.404.0>,
                                 {199518784,205011},
                                 #Fun<couch_db_updater.13.**40165027>,
                                 #Fun<couch_db_updater.14.**82810239>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.15.104121193>},
                                {btree,<0.404.0>,
                                 {111685732,[]},
                                 #Fun<couch_btree.0.23070627>,
                                 #Fun<couch_btree.1.117278773>,
                                 #Fun<couch_btree.2.112258129>,**nil},
                                205013,
                                <<"database1">>,
                                "/data/couchdb/data/database1.**couch",
                                [],[],nil,
                                {user_ctx,null,[],undefined},
                                nil,1000,
                                [before_header,after_header,**
on_file_open],
                                false}}]}},
                           [{gen_server,terminate,6},
                            {proc_lib,init_p_do_apply,3}]}**},
                         {ancestors,[<0.406.0>,<0.403.**0>]},
                         {messages,[{'EXIT',<0.406.0>,**shutdown}]},
                         {links,[]},
                         {dictionary,[]},
                         {trap_exit,true},
                         {status,running},
                         {heap_size,28657},
                         {stack_size,24},
                         {reductions,4487709}],
                        []]}}
[Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>] ** Generic server
<0.178.0>   terminating
** Last message in was {update_docs,<0.2027.0>,
                            [[{doc,<<"**55e776b94547442ab17b82bd1a0598**
43">>,
                                  {1,
                                   [<<102,77,172,235,192,72,84,**
223,58,68,105,
                                      199,153,147,196,81>>]},
                                  {[{<<"host">>,<<"Host1">>},
                                    {<<"time">>,1323375464000},
                                    {<<"text">>,
                                     <<"Some text">>},
                                    {<<"level">>,0},
                                    {<<"source">>,<<"source1">>},
                                    {<<"type">>,<<"Entry1">>}]},
                                  [],false,[]}],

...
[[A BUNCH OF DOCS HERE]]
...


                                  {[{<<"host">>,<<"Host1">>},
                                    {<<"time">>,1323375467000},
                                    {<<"text">>,
                                     <<"Some text">>},
                                    {<<"level">>,0},
                                    {<<"source">>,<<"source1">>},
                                    {<<"type">>,<<"Entry1">>}]},
                                  [],false,[]}]],
                            [],false,false}
** When Server state == {db,<0.177.0>,<0.178.0>,nil,<<**
"1323371411352029">>,
                             <0.175.0>,<0.179.0>,
                             {db_header,5,13636863,0,
                                 {6776455960,{13636861,0}},
                                 {6776479023,13636861},
                                 {1039786,[]},
                                 0,nil,nil,1000},
                             13636863,
                             {btree,<0.175.0>,
                                 {6776455960,{13636861,0}},
                                 #Fun<couch_db_updater.10.**19222179>,
                                 #Fun<couch_db_updater.11.**21515767>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.12.**93888648>},
                             {btree,<0.175.0>,
                                 {6776479023,13636861},
                                 #Fun<couch_db_updater.13.**40165027>,
                                 #Fun<couch_db_updater.14.**82810239>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.15.104121193>},
                             {btree,<0.175.0>,
                                 {1039786,[]},
                                 #Fun<couch_btree.0.23070627>,
                                 #Fun<couch_btree.1.117278773>,
                                 #Fun<couch_btree.2.112258129>,**nil},
                             13636863,
                             <<"database2">>,
                             "/data/couchdb/data/database2.**couch",
                             [],[],nil,
                             {user_ctx,null,[],undefined},
                             nil,1000,
                             [before_header,after_header,**on_file_open],
                             false}
** Reason for termination ==
** {timeout,
        {gen_server,call,
            [<0.177.0>,
             {db_updated,

{db,<0.177.0>,<0.178.0>,nil,<<**"1323371411352029">>,<0.175.0>**,
                     <0.179.0>,
                     {db_header,5,13636863,0,
                         {6776455960,{13636861,0}},
                         {6776479023,13636861},
                         {1039786,[]},
                         0,nil,nil,1000},
                     13636863,
                     {btree,<0.175.0>,
                         {6776557909,{13637061,0}},
                         #Fun<couch_db_updater.10.**19222179>,
                         #Fun<couch_db_updater.11.**21515767>,
                         #Fun<couch_btree.5.112258129>,
                         #Fun<couch_db_updater.12.**93888648>},
                     {btree,<0.175.0>,
                         {6776580448,13637061},
                         #Fun<couch_db_updater.13.**40165027>,
                         #Fun<couch_db_updater.14.**82810239>,
                         #Fun<couch_btree.5.112258129>,
                         #Fun<couch_db_updater.15.104121193>},
                     {btree,<0.175.0>,
                         {1039786,[]},
                         #Fun<couch_btree.0.23070627>,
                         #Fun<couch_btree.1.117278773>,
                         #Fun<couch_btree.2.112258129>,**nil},
                     13637063,
                     <<"database2">>,
                     "/data/couchdb/data/database2.**couch",
                     [],[],nil,
                     {user_ctx,null,[],undefined},
                     #Ref<0.0.30.133811>,1000,
                     [before_header,after_header,**on_file_open],
                     false}}]}}

[Thu, 08 Dec 2011 20:17:22 GMT] [error] [<0.178.0>]
{error_report,<0.31.0>,
                      {<0.178.0>,crash_report,

[[{initial_call,{couch_db_**updater,init,['Argument__1']}}**,
                         {pid,<0.178.0>},
                         {registered_name,[]},
                         {error_info,
                          {exit,
                           {timeout,
                            {gen_server,call,
                             [<0.177.0>,
                              {db_updated,
                               {db,<0.177.0>,<0.178.0>,nil,
                                <<"1323371411352029">>,<0.175.**
0>,<0.179.0>,
                                {db_header,5,13636863,0,
                                 {6776455960,{13636861,0}},
                                 {6776479023,13636861},
                                 {1039786,[]},
                                 0,nil,nil,1000},
                                13636863,
                                {btree,<0.175.0>,
                                 {6776557909,{13637061,0}},
                                 #Fun<couch_db_updater.10.**19222179>,
                                 #Fun<couch_db_updater.11.**21515767>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.12.**93888648>},
                                {btree,<0.175.0>,
                                 {6776580448,13637061},
                                 #Fun<couch_db_updater.13.**40165027>,
                                 #Fun<couch_db_updater.14.**82810239>,
                                 #Fun<couch_btree.5.112258129>,
                                 #Fun<couch_db_updater.15.104121193>},
                                {btree,<0.175.0>,
                                 {1039786,[]},
                                 #Fun<couch_btree.0.23070627>,
                                 #Fun<couch_btree.1.117278773>,
                                 #Fun<couch_btree.2.112258129>,**nil},
                                13637063,
                                <<"database2">>,
                                "/data/couchdb/data/database2.**couch",
                                [],[],nil,
                                {user_ctx,null,[],undefined},
                                #Ref<0.0.30.133811>,1000,
                                [before_header,after_header,**
on_file_open],
                                false}}]}},
                           [{gen_server,terminate,6},
                            {proc_lib,init_p_do_apply,3}]}**},
                         {ancestors,[<0.177.0>,<0.174.**0>]},
                         {messages,
                          [{'EXIT',<0.177.0>,shutdown},**
delayed_commit]},
                         {links,[]},
                         {dictionary,[]},
                         {trap_exit,true},
                         {status,running},
                         {heap_size,121393},
                         {stack_size,24},
                         {reductions,83311172}],
                        []]}}



Reply via email to