Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

*24/7 Observation for your CouchDB Instances:
https://opservatory.app

> On 4. Apr 2022, at 18:21, Roberto Iglesias <[email protected]> wrote:
> 
> Hello.
> 
> About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker 
> containers and pull-replicating one each other. This way, we could read from 
> and write to any of these servers, although we generally choose one as the 
> "active" server and write to it. The second server would act as a spare or 
> backup.
> 
> At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to 
> 3.1.1. Instead of upgrading our existing databases, we added two extra 
> instances and configure pull replications in all of them until we get the 
> following scenario:
> 
> 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B
> 
> where <===> represents two pull replications, one configured on each side. 
> i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.
> 
> If a write is made at 2.3.1-A, it has to make it through all servers until it 
> reaches 3.1.1-B.
> 
> All of them have an exclusive HDD which is not shared with any other service.
> 
> We have not a single problem with 2.3.1.
> 
> After pointing our services to 3.1.1-A, it gradually started to increase Read 
> I/O wait times over weeks until it reached peaks of 600ms (totally 
> unworkable). So we stopped making write requests (http POST) to it and 
> pointed all applications to 3.1.1-B. 3.1.1-A was still receiving writes but 
> only by replication protocol, as I explained before.
> 
> At 3.1.1-A server, disk stats decreased to acceptable values, so a few weeks 
> after we pointed applications back to it in order to confirm whether the 
> problem is related to write requests sent from our application or not. Read 
> I/O times did not increase this time. Instead, 3.1.1-B (which handled 
> application traffic for a few weeks), started to show the same behaviour, 
> despite it was not handling requests from applications.
> 
> It feels like some fragmentation is occurring, but filesystem (ext4) shows 
> none.
> 
> Some changes we've made since problem started:
>       • Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
>       • Upgraded ubuntu from 18.04 to 20.04
>       • Deleted _global_changes database from couchdb3.1.1-A
> 
> More info:
>       • Couchdb is using docker local-persist 
> (https://github.com/MatchbookLab/local-persist) volumes.
>       • Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1 
> couchdbs.
>       • We have only one database of 88GiB and 2 views: one of 22GB and a 
> little one of 30MB (highly updated)
>       • docker stats shows that couchdb3.1.1 uses lot of memory compared to 
> 2.3.1:
>       • 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
>       • 5.0GiB for couchdb3.1.1-B (receiving both read and write requests)
>       • 900MiB for 2.3.1-A
>       • 800MiB for 2.3.1-B
>       • Database compaction is run at night. Problem only occurs over day, 
> when most of the writes are made.

Did you account for the completely rewritten compaction daemon (smoosh) that 
has a different configuration from the one in 2.x?

https://docs.couchdb.org/en/stable/maintenance/compaction.html#compact-auto

Otherwise you might see compaction going on at all times (what we recommend, 
usually), rather than what you expect: just at night.

And in general, at this point, we strongly recommend running on SSDs for the 
obvious speed benefits :)

And finally: which Erlang version are you running? There are a few odd ones out 
there that might affect what you’re doing.

Best
Jan
—
>       • Most of the config is default.
>       • Latency graph from munin monitoring attached (at the peak, there is 
> an outage of the server caused by a kernel upgrade that went wrong)
> 
> Any help is appreciated.
> 
> -- 
> --
> 
> Roberto E. Iglesias

Reply via email to