We run a CouchDB 2.3.1 cluster with three geographically separated zones under 
a reasonably high load. During a recent WAN outage, we saw CouchDB start 
logging the following errors and quit responding to basic HTTP 'GET /' health 
check requests. Couch was logging the GETs but just not responding to them.


[error] 2020-08-05T17:53:31.047147Z [email protected] <0.4507.4745> 
-------- fabric_worker_timeout 
get_all_security,'[email protected]',<<"shards/00000000-ffffffff/account/c081d16d81ef4c59a25c9b26243dabcc.1591504717">>
[error] 2020-08-05T17:53:31.047195Z [email protected] <0.4507.4745> 
-------- Error checking security objects for 
account/c0/81/d16d81ef4c59a25c9b26243d77ab :: {error,timeout}
[error] 2020-08-05T17:53:44.576147Z [email protected] <0.16829.4647> 
-------- fabric_worker_timeout 
get_all_security,'[email protected]',<<"shards/00000000-ffffffff/accounts.1591508198">>
[error] 2020-08-05T17:53:44.576187Z [email protected] <0.16829.4647> 
-------- Error checking security objects for accounts :: {error,timeout}

error] 2020-08-05T17:54:19.009381Z [email protected] <0.257.0> -------- 
gen_server couch_compaction_daemon terminated with reason: 
{compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}}
[error] 2020-08-05T17:54:19.009543Z [email protected] <0.257.0> -------- 
CRASH REPORT Process couch_compaction_daemon (<0.257.0>) with 0 neighbors 
exited with reason: 
{compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} at 
gen_server:terminate/7(line:812) <= proc_lib:init_p_do_apply/3(line:247); 
initial_call: {couch_compaction_daemon,init,['Argument__1']}, ancestors: 
[couch_secondary_services,couch_sup,<0.211.0>], messages: [], links: 
[<0.220.0>], dictionary: [], trap_exit: true, status: running, heap_size: 1598, 
stack_size: 27, reductions: 3722
[error] 2020-08-05T17:54:19.009943Z [email protected] <0.220.0> -------- 
Supervisor couch_secondary_services had child compaction_daemon started with 
couch_compaction_daemon:start_link() at <0.257.0> exit with reason 
{compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} in 
context child_terminated
[error] 2020-08-05T17:54:24.010298Z [email protected] <0.7494.4646> 
-------- gen_server couch_compaction_daemon terminated with reason: 
{compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}}

followed by lots of fabric_worker_timeout log entries for any Couch queries.

Not sure what to make of the security objects errors, but we don't have 
autocompaction enabled on these CouchDB servers for these hours and all 
compaction had completed by 07:00 from the logs, so the couch_compaction_daemon 
errors were a bit of a surprise as well as the non-responses to GETs.

Our current cluster settings are:

[cluster]
q=1
r=1
w=1
placement = db-zone-a:1,db-zone-b:1,db-zone-c:1

We're trying to figure out why the local zone basically froze on us. It all 
recovered right away after the network connections came back.

Any ideas?

Thanks in advance,
Howard Hart
Ooma, Inc.

Reply via email to