After a couple more weeks of testing, as an helpful follow up for those that may follow, shutting down the replication only have increased the time between random couchdb hangings, not eliminated them. Same characteristics as before. Everything works perfectly without any signs of errors, till the moment it start freezing, always without signs of errors. If somebody have suggestions about what I can try, they are welcome...
Il giorno ven 3 mag 2019 alle ore 13:58 f z <fuzzyillo...@gmail.com> ha scritto: > As a maybe helpful follow up to the previous problem for people that may > find this post in future: > after many tests, it seems that the problem I mentioned is related to the > replication we set up, as I wondered in the post. > If the replication is turned off, the crash does not occur, at least in > weeks of testing, while if it's turned on it does (replication works > perfectly till the moment the database freeze, and after a reboot starts > working fine again). > We discovered while reading the documentation a piece we missed earlier, > that couchdb have issues with continuous replication if its behind an > apache reverse proxy, and that was the case here, so maybe that's the > culprit, although it's not clear how a caching problem may lead to the > database freezing without any error messages. > > We will now test with replication and a different reverse proxy in front > of the db, to be sure that's the case. > > Il giorno mar 23 apr 2019 alle ore 12:43 f z <fuzzyillo...@gmail.com> ha > scritto: > >> Hello, we are developing a new application based on a couchdb database, >> and we had some strange problems. >> We are using couchdb 2.3.0 in docker installation based on the official >> images. >> During the development cycle everything worked fine with simulated >> workloads comparable to the final usage. >> When we installed the application on two early production machines, with >> a workload much lighter than the development one and more powerful >> dedicated servers, during the night while being unused by any user both of >> the couchdb started behaving erratically, one of them freezing completely: >> in the logs I can't see anything logged since the moment it froze, the >> process seems to be running, but any call made by curl to couch throw out >> after a lenghty timeout an error like the following: >> >> *{"error":"case_clause","reason":"{timeout,{[{{shard,<<\"shards/00000000-1fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [0,536870911],\n >> #Ref<0.0.2097153.224211>,[]},\n nil},\n >> {{shard,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [536870912,1073741823],\n >> #Ref<0.0.2097153.224212>,[]},\n 27988},\n >> {{shard,<<\"shards/40000000-5fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [1073741824,1610612735],\n >> #Ref<0.0.2097153.224213>,[]},\n nil},\n >> {{shard,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [1610612736,2147483647],\n >> #Ref<0.0.2097153.224214>,[]},\n 28553},\n >> {{shard,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [2147483648,2684354559],\n >> #Ref<0.0.2097153.224215>,[]},\n 27969},\n >> {{shard,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [2684354560,3221225471],\n >> #Ref<0.0.2097153.224216>,[]},\n 28502},\n >> {{shard,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [3221225472,3758096383],\n* >> *#Ref<0.0.2097153.224217>,[]},\n 28633},\n >> {{shard,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [3758096384,4294967295],\n >> #Ref<0.0.2097153.224218>,[]},\n 28010}],\n >> [{{shard,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [3758096384,4294967295],\n >> #Ref<0.0.2097153.224218>,[]},\n 0},\n >> {{shard,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [3221225472,3758096383],\n >> #Ref<0.0.2097153.224217>,[]},\n 0},\n >> {{shard,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [2684354560,3221225471],\n >> #Ref<0.0.2097153.224216>,[]},\n 0},\n >> {{shard,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [2147483648,2684354559],\n * >> *#Ref<0.0.2097153.224215>,[]},\n 0},\n >> {{shard,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [1610612736,2147483647],\n >> #Ref<0.0.2097153.224214>,[]},\n 0},\n >> {{shard,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>,\n >> nonode@nohost,<<\"mastercom\">>,\n >> [536870912,1073741823],\n >> #Ref<0.0.2097153.224212>,[]},\n 0}],\n >> [[{db_name,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>},\n >> {engine,couch_bt_engine},\n {doc_count,371},\n >> {doc_del_count,0},\n {update_seq,28010},\n >> {purge_seq,0},\n {compact_running,false},\n >> {sizes,{[{active,1282636},{external,109712},{file,1298647}]}},\n >> {disk_size,1298647},\n {data_size,1282636},\n >> {other,{[{data_size,109712}]}},\n >> {instance_start_time,<<\"1555521000894744\">>},\n >> {disk_format_version,7},\n >> {committed_update_seq,28010},\n >> {compacted_seq,28010},\n >> {uuid,<<\"c3f60d2791bf5c3de01063af95f255b1\">>}],\n * >> *[{db_name,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>},\n >> {engine,couch_bt_engine},\n {doc_count,379},\n >> {doc_del_count,1},\n {update_seq,28633},\n >> {purge_seq,0},\n {compact_running,false},\n >> {sizes,{[{active,1314436},{external,117751},{file,1331415}]}},\n >> {disk_size,1331415},\n {data_size,1314436},\n >> {other,{[{data_size,117751}]}},\n >> {instance_start_time,<<\"1555521000894769\">>},\n >> {disk_format_version,7},\n >> {committed_update_seq,28633},\n >> {compacted_seq,28633},\n >> {uuid,<<\"740de6d1e5535ba64e43bc3ead349e84\">>}],\n >> [{db_name,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>},\n >> {engine,couch_bt_engine},\n {doc_count,378},\n >> {doc_del_count,0},\n {update_seq,28502},\n >> {purge_seq,0},\n {compact_running,false},\n >> {sizes,{[{active,1308584},{external,116630},{file,1413335}]}},\n >> {disk_size,1413335},\n {data_size,1308584},\n >> {other,{[{data_size,116630}]}},\n >> {instance_start_time,<<\"1555521000894704\">>},\n >> {disk_format_version,7},\n >> {committed_update_seq,28502},\n >> {compacted_seq,28502},\n >> {uuid,<<\"d6acfc8cdf0a172df3d8db22df1a4b68\">>}],\n* >> *[{db_name,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>},\n >> {engine,couch_bt_engine},\n {doc_count,373},\n >> {doc_del_count,0},\n {update_seq,27969},\n >> {purge_seq,0},\n {compact_running,false},\n >> {sizes,{[{active,1290174},{external,125829},{file,1335511}]}},\n >> {disk_size,1335511},\n {data_size,1290174},\n >> {other,{[{data_size,125829}]}},\n >> {instance_start_time,<<\"1555521000894811\">>},\n >> {disk_format_version,7},\n >> {committed_update_seq,27969},\n >> {compacted_seq,27969},\n >> {uuid,<<\"6735a0b95e5ebdc3c5dfe08b4c81a108\">>}],\n >> [{db_name,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>},\n >> {engine,couch_bt_engine},\n {doc_count,379},\n >> {doc_del_count,0},\n {update_seq,28553},\n >> {purge_seq,0},\n {compact_running,false},\n >> {sizes,{[{active,1312728},{external,123640},{file,1327319}]}},\n >> {disk_size,1327319},\n {data_size,1312728},\n >> {other,{[{data_size,123640}]}},\n >> {instance_start_time,<<\"1555521000894765\">>},\n >> {disk_format_version,7},\n >> {committed_update_seq,28553},\n >> {compacted_seq,28553},\n >> {uuid,<<\"4086cc00f08dea80fd962808cf9e37bf\">>}],\n* >> *[{db_name,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>},\n >> {engine,couch_bt_engine},\n {doc_count,372},\n >> {doc_del_count,0},\n {update_seq,27988},\n >> {purge_seq,0},\n {compact_running,false},\n >> {sizes,{[{active,1287520},{external,121646},{file,1302743}]}},\n >> {disk_size,1302743},\n {data_size,1287520},\n >> {other,{[{data_size,121646}]}},\n >> {instance_start_time,<<\"1555521000894678\">>},\n >> {disk_format_version,7},\n >> {committed_update_seq,27988},\n >> {compacted_seq,27988},\n >> {uuid,<<\"c1327baaacb6ac82b71d2c540648e82d\">>}],\n >> {cluster,[{q,8},{n,1},{w,1},{r,1}]}]}}","ref":3845389673}* >> >> On the other server, one of the databases went on answering regularly to >> requests, while the other hanged up with messages like the one above. Logs >> were regular for the answer database, not-existant for the hanged up one. >> While trying to access some general-level stats, the still working one >> froze too. >> Only an hard reboot of the docker managed to have them working again. No >> data was apparently corrupted or problematic. >> The day after, it happened again, although this time they were both >> completely frozen when we found them. >> In the entire db, there were only about a 2k quite small records, no >> views and only a couple of indexes for the mango queries. >> The db that the first time hanged individually on the second server was >> completely empty. >> The main difference between the development environment and the crashing >> one is that in this one there is also an unfiltered bidirectional >> replication setup, if this can be related to the problem, but before the >> crash, the replication works perfectly without errors or any strange >> message. >> >> >> I realize that these may be insufficient information here to understand >> the cause, but the problem is that it happens unreliably, it never happened >> in months in the development environment, the parameters that we checked >> are all the same, and I've no idea were to watch for some signs of what may >> be the trouble. >> >> What can I check? Which parameters may be the ones responsible for such a >> problem? >> >> Thanks in advance to everybody... >> >