Hi Jérôme and Adam, That's funny, because I'm investigating the exact same problem these days. We have a two CouchDB setups: - a one-node server (q=2 n=1) with 5000 databases - a 3-node cluster (q=2 n=3) with 50000 databases
... and we are experiencing the problem on both setups. We've been having this problem for at least 3-4 months. We've monitored: - The number of open files: it's relatively low (both the system's total and or fds opened by beam.smp). https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png - The usage of RAM, total used and used by beam.smp https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png It continuously grows, with regular spikes, until killing CouchDB with an OOM. After restart, the RAM usage is nice and low, and no spikes. - /_node/_local/_system metrics, before and after restart. Values that significantly differ (before / after restart) are listed here: - uptime (obviously ;-)) - memory.processes : + 3732 % - memory.processes_used : + 3735 % - memory.binary : + 17700 % - context_switches : + 17376 % - reductions : + 867832 % - garbage_collection_count : + 448248 % - words_reclaimed : + 112755 % - io_input : + 44226 % - io_output : + 157951 % Before CouchDB restart: { "uptime":2712973, "memory":{ "other":7250289, "atom":512625, "atom_used":510002, "processes":1877591424, "processes_used":1877504920, "binary":177468848, "code":9653286, "ets":16012736 }, "run_queue":0, "ets_table_count":102, "context_switches":1621495509, "reductions":968705947589, "garbage_collection_count":331826928, "words_reclaimed":269964293572, "io_input":8812455, "io_output":20733066, ... After CouchDB restart: { "uptime":206, "memory":{ "other":6907493, "atom":512625, "atom_used":497769, "processes":49001944, "processes_used":48963168, "binary":997032, "code":9233842, "ets":4779576 }, "run_queue":0, "ets_table_count":102, "context_switches":1015486, "reductions":111610788, "garbage_collection_count":74011, "words_reclaimed":239214127, "io_input":19881, "io_output":13118, ... Adrien Le ven. 14 juin 2019 à 15:11, Jérôme Augé <jerome.a...@anakeen.com> a écrit : > Ok, so I'll setup a cron job to journalize (every minute?) the output from > "/_node/_local/_system" and wait for the next OOM kill. > > Any property from "_system" to look for in particular? > > Here is a link to the memory usage graph: > https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png > > The memory usage varies, but the general trend is to go up with some > regularity over a week until we reach OOM. When "beam.smp" is killed, it's > reported as consuming 15 GB (as seen in the kernel's OOM trace in syslog). > > Thanks, > Jérôme > > Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <kocol...@apache.org> a > écrit : > > > Hi Jérôme, > > > > Thanks for a well-written and detailed report (though the mailing list > > strips attachments). The _system endpoint provides a lot of useful data > for > > debugging these kinds of situations; do you have a snapshot of the output > > when the system was consuming a lot of memory? > > > > > > > http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system > > > > Adam > > > > > On Jun 14, 2019, at 5:44 AM, Jérôme Augé <jerome.a...@anakeen.com> > > wrote: > > > > > > Hi, > > > > > > I'm having a hard time figuring out the high memory usage of a CouchDB > > server. > > > > > > What I'm observing is that the memory consumption from the "beam.smp" > > process gradually rises until it triggers the kernel's OOM > (Out-Of-Memory) > > which kill the "beam.smp" process. > > > > > > It also seems that many databases are not compacted: I've made a script > > to iterate over the databases to compute de fragmentation factor, and it > > seems I have around 2100 databases with a frag > 70%. > > > > > > We have a single CouchDB v2.1.1server (configured with q=8 n=1) and > > around 2770 databases. > > > > > > The server initially had 4 GB of RAM, and we are now with 16 GB w/ 8 > > vCPU, and it still regularly reaches OOM. From the monitoring I see that > > with 16 GB the OOM is almost triggered once per week (c.f. attached > graph). > > > > > > The memory usage seems to increase gradually until it reaches OOM. > > > > > > The Couch server is mostly used by web clients with the PouchDB JS API. > > > > > > We have ~1300 distinct users and by monitoring the netstat/TCP > > established connections I guess we have around 100 (maximum) users at any > > given time. From what I understanding of the application's logic, each > user > > access 2 private databases (read/write) + 1 common database (read-only). > > > > > > On-disk usage of CouchDB's data directory is around 40 GB. > > > > > > Any ideas on what could cause such behavior (increasing memory usage > > over the course of a week)? Or how to find what is happening behind the > > scene? > > > > > > Regards, > > > Jérôme > > >