Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/
*24/7 Observation for your CouchDB Instances: https://opservatory.app > On 4. Apr 2022, at 18:21, Roberto Iglesias <[email protected]> wrote: > > Hello. > > About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker > containers and pull-replicating one each other. This way, we could read from > and write to any of these servers, although we generally choose one as the > "active" server and write to it. The second server would act as a spare or > backup. > > At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to > 3.1.1. Instead of upgrading our existing databases, we added two extra > instances and configure pull replications in all of them until we get the > following scenario: > > 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B > > where <===> represents two pull replications, one configured on each side. > i.e: 2.3.1-A pulls from 2.3.1-B and vice versa. > > If a write is made at 2.3.1-A, it has to make it through all servers until it > reaches 3.1.1-B. > > All of them have an exclusive HDD which is not shared with any other service. > > We have not a single problem with 2.3.1. > > After pointing our services to 3.1.1-A, it gradually started to increase Read > I/O wait times over weeks until it reached peaks of 600ms (totally > unworkable). So we stopped making write requests (http POST) to it and > pointed all applications to 3.1.1-B. 3.1.1-A was still receiving writes but > only by replication protocol, as I explained before. > > At 3.1.1-A server, disk stats decreased to acceptable values, so a few weeks > after we pointed applications back to it in order to confirm whether the > problem is related to write requests sent from our application or not. Read > I/O times did not increase this time. Instead, 3.1.1-B (which handled > application traffic for a few weeks), started to show the same behaviour, > despite it was not handling requests from applications. > > It feels like some fragmentation is occurring, but filesystem (ext4) shows > none. > > Some changes we've made since problem started: > • Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic > • Upgraded ubuntu from 18.04 to 20.04 > • Deleted _global_changes database from couchdb3.1.1-A > > More info: > • Couchdb is using docker local-persist > (https://github.com/MatchbookLab/local-persist) volumes. > • Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1 > couchdbs. > • We have only one database of 88GiB and 2 views: one of 22GB and a > little one of 30MB (highly updated) > • docker stats shows that couchdb3.1.1 uses lot of memory compared to > 2.3.1: > • 2.5GiB for couchdb3.1.1-A (not receiving direct write requests) > • 5.0GiB for couchdb3.1.1-B (receiving both read and write requests) > • 900MiB for 2.3.1-A > • 800MiB for 2.3.1-B > • Database compaction is run at night. Problem only occurs over day, > when most of the writes are made. Did you account for the completely rewritten compaction daemon (smoosh) that has a different configuration from the one in 2.x? https://docs.couchdb.org/en/stable/maintenance/compaction.html#compact-auto Otherwise you might see compaction going on at all times (what we recommend, usually), rather than what you expect: just at night. And in general, at this point, we strongly recommend running on SSDs for the obvious speed benefits :) And finally: which Erlang version are you running? There are a few odd ones out there that might affect what you’re doing. Best Jan — > • Most of the config is default. > • Latency graph from munin monitoring attached (at the peak, there is > an outage of the server caused by a kernel upgrade that went wrong) > > Any help is appreciated. > > -- > -- > > Roberto E. Iglesias
