Re: CouchDB 3.1.1 high disk utilization and degradation over time

Roberto Iglesias Tue, 05 Apr 2022 06:02:25 -0700

Hi, Hoel, thanks for sharing.

To me, it doesn't seem to be the same problem, as we use a single HDD (WD
Black) for this purpose and is not throttled by anything. In fact, IO usage
is increasing over time until it reaches unworkable values. To summarize,
CouchDB is not sharing IO resources with any other service.


I'd like to provide more information, so here is our CouchDB configuration
(provided by _node/_local/_config endpoint):

{
  "smoosh.slack_dbs": {
    "from": "00:00",
    "min_priority": "5368709120",
    "strict_window": "true",
    "to": "08:00"
  },
  "uuids": {
    "algorithm": "sequential",
    "max_count": "1000"
  },
  "cluster": {
    "n": "3",
    "q": "2"
  },
  "cors": {
    "credentials": "false"
  },
  "chttpd": {
    "backlog": "512",
    "bind_address": "any",
    "max_db_number_for_dbs_info_req": "100",
    "port": "5984",
    "prefer_minimal": "Cache-Control, Content-Length, Content-Range,
Content-Type, ETag, Server, Transfer-Encoding, Vary",
    "require_valid_user": "false",
    "server_options": "[{recbuf, undefined}]",
    "socket_options": "[{sndbuf, 262144}, {nodelay, true}]"
  },
  "attachments": {
    "compressible_types": "text/*, application/javascript,
application/json, application/xml",
    "compression_level": "8"
  },
  "admins": {
    "admin":
"-pbkdf2-f0aacb520f0bfc4aa90c30045fac25acd431c4ac,7adbc3635ee33dec5e2f38ada9e6bf55,10"
  },
  "query_server_config": {
    "os_process_limit": "100",
    "reduce_limit": "true"
  },
  "vendor": {
    "name": "The Apache Software Foundation"
  },
  "smoosh.ratio_views": {
    "from": "00:00",
    "min_priority": "5.0",
    "strict_window": "true",
    "to": "08:00"
  },
  "feature_flags": {
    "partitioned||*": "true"
  },
  "replicator": {
    "connection_timeout": "30000",
    "http_connections": "20",
    "interval": "60000",
    "max_churn": "20",
    "max_jobs": "500",
    "retries_per_request": "5",
    "socket_options": "[{keepalive, true}, {nodelay, false}]",
    "ssl_certificate_max_depth": "3",
    "startup_jitter": "5000",
    "verify_ssl_certificates": "false",
    "worker_batch_size": "500",
    "worker_processes": "1"
  },
  "ssl": {
    "port": "6984"
  },
  "smoosh.slack_views": {
    "from": "00:00",
    "min_priority": "3758096384",
    "strict_window": "true",
    "to": "08:00"
  },
  "log": {
    "file": "/opt/couchdb/var/log/couch.log",
    "level": "info",
    "write_buffer": "1048576",
    "write_delay": "5000",
    "writer": "file"
  },
  "indexers": {
    "couch_mrview": "true"
  },
  "couch_peruser": {
    "database_prefix": "userdb-",
    "delete_dbs": "false",
    "enable": "false"
  },
  "httpd": {
    "allow_jsonp": "false",
    "authentication_handlers": "{couch_httpd_auth,
cookie_authentication_handler}, {couch_httpd_auth,
default_authentication_handler}",
    "bind_address": "any",
    "enable_cors": "false",
    "enable_xframe_options": "false",
    "max_http_request_size": "4294967296",
    "port": "5986",
    "secure_rewrites": "true",
    "socket_options": "[{sndbuf, 262144}]"
  },
  "ioq.bypass": {
    "compaction": "false",
    "os_process": "true",
    "read": "true",
    "shard_sync": "false",
    "view_update": "true",
    "write": "true"
  },
  "ioq": {
    "concurrency": "10",
    "ratio": "0.01"
  },
  "smoosh.ratio_dbs": {
    "from": "00:00",
    "min_priority": "5.0",
    "strict_window": "true",
    "to": "08:00"
  },
  "csp": {
    "enable": "true"
  },
  "couch_httpd_auth": {
    "allow_persistent_cookies": "true",
    "auth_cache_size": "50",
    "authentication_db": "_users",
    "authentication_redirect": "/_utils/session.html",
    "iterations": "10",
    "require_valid_user": "false",
    "secret": "d6fbdfb5c21b756c94abd8fb0be54d17",
    "timeout": "600"
  },
  "couchdb_engines": {
    "couch": "couch_bt_engine"
  },
  "couchdb": {
    "attachment_stream_buffer_size": "4096",
    "changes_doc_ids_optimization_threshold": "100",
    "database_dir": "./data",
    "default_engine": "couch",
    "default_security": "admin_only",
    "file_compression": "snappy",
    "max_dbs_open": "500",
    "max_document_size": "8000000",
    "os_process_timeout": "5000",
    "users_db_security_editable": "false",
    "uuid": "5b2a3feb760f43fd811abf568eb1f6b2",
    "view_index_dir": "./data"
  }
}


Also, I've noticed that attachments are not published in mailing lists, so
here is a link to the image I added in my first email:
https://ibb.co/L9mvZRB

Thanks again!

On Tue, Apr 5, 2022 at 4:36 AM Hoël Iris <[email protected]> wrote:

> My configuration is different (a lot of small DBs) but I had disk I/O
> performance issues too when upgrading from CouchDB 2 to CouchDB 3.
> Maybe it's related, maybe it's not.
> I use AWS, the solution for me was to increase AWS disk IOPs.
>
> See the full discussion here:
> https://github.com/apache/couchdb/discussions/3217
>
>
> Le lun. 4 avr. 2022 à 18:22, Roberto Iglesias <[email protected]> a
> écrit :
>
> > Hello.
> >
> > About 1 year ago, we had two CouchDB 2.3.1 instances running inside
> Docker
> > containers and pull-replicating one each other. This way, we could read
> > from and write to any of these servers, although we generally choose one
> as
> > the "active" server and write to it. The second server would act as a
> spare
> > or backup.
> >
> > At this point (1y ago) we decided to migrate from CouchDB version 2.3.1
> to
> > 3.1.1. Instead of upgrading our existing databases, we added two extra
> > instances and configure pull replications in all of them until we get the
> > following scenario:
> >
> > 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B
> >
> > where <===> represents two pull replications, one configured on each
> side.
> > i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.
> >
> > If a write is made at 2.3.1-A, it has to make it through all servers
> until
> > it reaches 3.1.1-B.
> >
> > All of them have an exclusive HDD which is not shared with any other
> > service.
> >
> > We have not a single problem with 2.3.1.
> >
> > After pointing our services to 3.1.1-*A*, it gradually started to
> > increase Read I/O wait times over weeks until it reached peaks of 600ms
> > (totally unworkable). So we stopped making write requests (http POST) to
> it
> > and pointed all applications to 3.1.1-*B*. 3.1.1-*A* was still receiving
> > writes but only by replication protocol, as I explained before.
> >
> > At 3.1.1-*A* server, disk stats decreased to acceptable values, so a few
> > weeks after we pointed applications back to it in order to confirm
> whether
> > the problem is related to write requests sent from our application or
> not.
> > Read I/O times did not increase this time. Instead, 3.1.1-B (which
> handled
> > application traffic for a few weeks), started to show the same behaviour,
> > despite it was not handling requests from applications.
> >
> > It feels like some fragmentation is occurring, but filesystem (ext4)
> shows
> > none.
> >
> > Some changes we've made since problem started:
> >
> >    - Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
> >    - Upgraded ubuntu from 18.04 to 20.04
> >    - Deleted _global_changes database from couchdb3.1.1-A
> >
> >
> > More info:
> >
> >    - Couchdb is using docker local-persist (
> >    https://github.com/MatchbookLab/local-persist) volumes.
> >    - Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1
> >    couchdbs.
> >    - We have only one database of 88GiB and 2 views: one of 22GB and a
> >    little one of 30MB (highly updated)
> >    - docker stats shows that couchdb3.1.1 uses lot of memory compared to
> >    2.3.1:
> >    - 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
> >    - 5.0GiB for couchdb3.1.1-B (receiving both read and write requests)
> >    - 900MiB for 2.3.1-A
> >    - 800MiB for 2.3.1-B
> >    - Database compaction is run at night. Problem only occurs over day,
> >    when most of the writes are made.
> >    - Most of the config is default.
> >    - Latency graph from munin monitoring attached (at the peak, there is
> >    an outage of the server caused by a kernel upgrade that went wrong)
> >
> >
> > Any help is appreciated.
> >
> > --
> > --
> >
> > *Roberto E. Iglesias*
> >
>

Re: CouchDB 3.1.1 high disk utilization and degradation over time

Reply via email to