uschtwill opened a new issue #859: Various error messages and breakage
URL: https://github.com/apache/couchdb/issues/859
 
 
   I do realise this is (probably) not a bug report, but I had to write down 
all of this somewhere, right?
   
   And I feel like it would be nice to document the explanations for the 
various error messages and their root cause(s) somewhere (that someone 
hopefully will be able to provide), so other people who might be getting these 
errors in the future may then have more to go by then I do now.
   
   Please let me know if I can be more precise, add additional information or 
do anything else to improve this inquiry.
   
   ## Context
   We are using CouchDB to store documents which are read out with Logstash and 
its [couchdb_changes 
plugin](https://www.elastic.co/guide/en/logstash/current/plugins-inputs-couchdb_changes.html),
 which then modifies it and pushes it downstream for further processing. 
   
   We have 3 single node instances, which all exhibit the behaviour described 
below: prod and staging are holding around 10M documents, and a dev instance 
just a couple of thousands. All instances are running as Docker containers on a 
single Dockerhost which is also running a lot of other containers (80+). 
   
   ## Current Behavior
   We are getting the following errors which I don't know how to interpret. I 
have yet to properly and predictably reproduce these errors, and I am not sure 
what all of the consequences are and what the root cause is. So I am pretty 
much lost.
   
   The error that is concerning me the most is:
   ```
   [error] 2017-10-02T15:14:56.440179Z nonode@nohost <0.23835.5135> 1c9aabe372 
req_err(527034634) timeout : nil
     
[{rexi,stream2,3,[{file,"src/rexi.erl"},{line,213}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,349}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,783}]},{couch_btree,stream_kp_node,7,[{file,"src/couch_btree.erl"},{line,710}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,754}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,220}]},{couch_db,changes_since,5,[{file,"src/couch_db.erl"},{line,1244}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,83}]}]
   [error] 2017-10-02T15:14:56.440653Z nonode@nohost <0.23835.5135> 1c9aabe372 
Response abnormally terminated: 
{timeout,nil,[{rexi,stream2,3,[{file,"src/rexi.erl"},{line,213}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,349}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,783}]},{couch_btree,stream_kp_node,7,[{file,"src/couch_btree.erl"},{line,710}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,754}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,220}]},{couch_db,changes_since,5,[{file,"src/couch_db.erl"},{line,1244}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,83}]}]}
   ```
   This one makes Logstash reset/lose track of the file it uses to track which 
update_sequence it is at. This is actually how I noticed it, because Logstash 
jumped back a couple of million updates. Ouch.
   
   Another error that regularly keeps popping up around the former one (in 
batches of up to 10 or so) is:
   ```
   [error] 2017-10-02T12:58:21.722899Z nonode@nohost <0.18748.5131> 1c9aabe372 
rexi_server exit:timeout 
[{rexi,stream2,3,[{file,"src/rexi.erl"},{line,213}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,349}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,783}]},{couch_btree,stream_kp_node,7,[{file,"src/couch_btree.erl"},{line,710}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,754}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,220}]},{couch_db,changes_since,5,[{file,"src/couch_db.erl"},{line,1244}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,83}]}]
   ```
   ... or as a shorter variation:
   ```
   [error] 2017-10-01T09:55:45.098141Z nonode@nohost <0.26324.4873> 1bc0992f85 
rexi_server exit:timeout 
[{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{couch_mrview,query_view,5,[{file,"src/couch_mrview.erl"},{line,255}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
   ```
   This one also occurs in batches:
   ```
   [error] 2017-10-02T13:50:21.375009Z nonode@nohost <0.17003.82> -------- OS 
Process Error <0.26804.77> :: {os_process_error,"OS process timed out."}
   ```
   and so is occasionally:
   ```
   [error] 2017-10-02T13:48:12.959135Z nonode@nohost emulator -------- Error in 
process <0.10836.82> with exit value: {{nocatch,{os_process_error,"OS process 
timed 
out."}},[{couch_os_process,prompt,2,[{file,"src/couch_os_process.erl"},{line,59}]},{couch_query_servers,map_doc_raw,2,[{file,"src/couch_query_servers.erl"},{line,67}]},{couch_mrview_updater...
 
     [<<"couch_mrview_util:get_view/4 L56">>,<<"couch_mrview:query_view/6 
L244">>,<<"rexi_server:init_p/3 L139">>]
   ```
   In the logs of the dev env I also found:
   ```
   [error] 2017-10-02T13:49:15.968875Z nonode@nohost <0.11410.82> a95cb34482 
req_err(476198327) unknown_error : function_clause
     [<<"couch_stats_httpd:transform_stats/2 
L34">>,<<"chttpd_misc:handle_node_req/1 L289">>,<<"chttpd:process_request/1 
L293">>,<<"chttpd:handle_request_int/1 L229">>,<<"mochiweb_http:headers/6 
L122">>,<<"proc_lib:init_p_do_apply/3 L237">>]
   ```
   and:
   ```
   [error] 2017-10-02T13:48:59.581334Z nonode@nohost <0.11410.82> 3ee8c49d37 
req_err(476198327) unknown_error : function_clause
   ```
   or crash reports like this:
   ```
   [error] 2017-10-02T13:48:54.516307Z nonode@nohost <0.11488.82> -------- 
gen_server couch_compaction_daemon terminated with reason: 
{compaction_loop_died,{timeout,{gen_server,call,[<0.262.0>,start_compact]}}}
     last msg: 
{'EXIT',<0.11672.82>,{timeout,{gen_server,call,[<0.262.0>,start_compact]}}}
     state: {state,<0.11672.82>,[<<"shards/80000000-9fffffff/db.1506787408">>]}
   [error] 2017-10-02T13:48:05.838229Z nonode@nohost <0.11650.82> -------- 
CRASH REPORT Process  (<0.11650.82>) with 0 neighbors exited with reason: 
{timeout,{gen_server,call,[couch_httpd_vhost,get_state]}} at 
gen_server:call/2(line:182) <= couch_httpd_vhost:dispatch_host/1(line:102) <= 
chttpd:handle_request/1(line:142) <= mochiweb_http:headers/6(line:122) <= 
proc_lib:init_p_do_apply/3(line:237); initial_call: 
{mochiweb_acceptor,init,['Argument__1','Argument__2',...]}, ancestors: 
[chttpd,chttpd_sup,<0.355.0>], messages: 
[{#Ref<0.0.71.43291>,{vhosts_state,[],["_utils","_...",...],...}}], links: 
[<0.358.0>,#Port<0.64807>], dictionary: [{couch_rewrite_count,0}], trap_exit: 
false, status: running, heap_size: 1598, stack_size: 27, reductions: 1126
   ```
   
   ## Possible Solution
   When we first got this error a while ago (the very top one) I set 
   ```
   [fabric]
   request_timeout = infinity
   ```
   for all of the instances. The error messages (the first two from above) 
stopped, and I thought I had solved the problem. But yesterday things started 
breaking again, the resets happened on staging and prod and all mayhem broke 
lose on the environment I was currently developing against.
   
   From what I have read so far, my solution right now would be to 
significantly increase `os_process_timeout` and `os_process_limit` (they're 
still at their default values, 5000 and 25), although I am not entirely sure 
what is happening in the first place. There seems to be so many things breaking 
at the same time (across environments), I suspect that increasing those 
settings might make the symptoms go away but may not address the root cause. 
From my googling I suspect something to do with ulimits/file descriptors/open 
files, but I am a kind of out of my depth there.
   
   And from inside one of the containers everything looks fine:
   ```
   root@e439ab2acd96:/opt/couchdb# ulimit -a
   core file size          (blocks, -c) 0
   data seg size           (kbytes, -d) unlimited
   scheduling priority             (-e) 0
   file size               (blocks, -f) unlimited
   pending signals                 (-i) 128197
   max locked memory       (kbytes, -l) 64
   max memory size         (kbytes, -m) unlimited
   open files                      (-n) 524288
   pipe size            (512 bytes, -p) 8
   POSIX message queues     (bytes, -q) 819200
   real-time priority              (-r) 0
   stack size              (kbytes, -s) 8192
   cpu time               (seconds, -t) unlimited
   max user processes              (-u) 524288
   virtual memory          (kbytes, -v) unlimited
   file locks                      (-x) unlimited
   root@e439ab2acd96:/opt/couchdb# cat /proc/sys/fs/file-nr
   35808   0       3274312
   ```
   
   On the other hand I just realised that on the host the limit for open files 
seems ridiculously low, although that hasn't been an issue so far. Maybe that's 
it already? Anyways, I would like to not dirty the crime scene so I will leave 
it as this for now.
   ```
   root@vps:~# ulimit -n
   1024
   ```
   
   ANY pointers are welcome...
   
   ## Your Environment
   * Version used: Docker klaemo/couchdb:2.0.0
   * Operating System and version (desktop or mobile): Ubuntu 14.04
   * Dockerhost: 32GB RAM, 12 Core, 500GB SSD VPS
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to