We run a CouchDB 2.3.1 cluster with three geographically separated zones under a reasonably high load. During a recent WAN outage, we saw CouchDB start logging the following errors and quit responding to basic HTTP 'GET /' health check requests. Couch was logging the GETs but just not responding to them.
[error] 2020-08-05T17:53:31.047147Z [email protected] <0.4507.4745> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-ffffffff/account/c081d16d81ef4c59a25c9b26243dabcc.1591504717">> [error] 2020-08-05T17:53:31.047195Z [email protected] <0.4507.4745> -------- Error checking security objects for account/c0/81/d16d81ef4c59a25c9b26243d77ab :: {error,timeout} [error] 2020-08-05T17:53:44.576147Z [email protected] <0.16829.4647> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-ffffffff/accounts.1591508198">> [error] 2020-08-05T17:53:44.576187Z [email protected] <0.16829.4647> -------- Error checking security objects for accounts :: {error,timeout} error] 2020-08-05T17:54:19.009381Z [email protected] <0.257.0> -------- gen_server couch_compaction_daemon terminated with reason: {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} [error] 2020-08-05T17:54:19.009543Z [email protected] <0.257.0> -------- CRASH REPORT Process couch_compaction_daemon (<0.257.0>) with 0 neighbors exited with reason: {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} at gen_server:terminate/7(line:812) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_compaction_daemon,init,['Argument__1']}, ancestors: [couch_secondary_services,couch_sup,<0.211.0>], messages: [], links: [<0.220.0>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 27, reductions: 3722 [error] 2020-08-05T17:54:19.009943Z [email protected] <0.220.0> -------- Supervisor couch_secondary_services had child compaction_daemon started with couch_compaction_daemon:start_link() at <0.257.0> exit with reason {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} in context child_terminated [error] 2020-08-05T17:54:24.010298Z [email protected] <0.7494.4646> -------- gen_server couch_compaction_daemon terminated with reason: {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} followed by lots of fabric_worker_timeout log entries for any Couch queries. Not sure what to make of the security objects errors, but we don't have autocompaction enabled on these CouchDB servers for these hours and all compaction had completed by 07:00 from the logs, so the couch_compaction_daemon errors were a bit of a surprise as well as the non-responses to GETs. Our current cluster settings are: [cluster] q=1 r=1 w=1 placement = db-zone-a:1,db-zone-b:1,db-zone-c:1 We're trying to figure out why the local zone basically froze on us. It all recovered right away after the network connections came back. Any ideas? Thanks in advance, Howard Hart Ooma, Inc.
