Hey guys. I bet it's a mailbox leaking memory. I am very interested in debugging issues like this too.
I can suggest to get an erlang shell and run these commands to see the top memory consuming processes https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html One issue I will be reporting soon is if one of your nodes is down for some amount of time, it seems like all databases independently try and retry to query the missing node and fail, resulting in printing a lot of logs for each db which can overwhelm the logger process. If you have a lot of DBs this makes the problem worse, but it doesn't happen right away for some reason. On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé <adrien.ve...@tolteck.com> wrote: > Hi Jérôme and Adam, > > That's funny, because I'm investigating the exact same problem these days. > We have a two CouchDB setups: > - a one-node server (q=2 n=1) with 5000 databases > - a 3-node cluster (q=2 n=3) with 50000 databases > > ... and we are experiencing the problem on both setups. We've been having > this problem for at least 3-4 months. > > We've monitored: > > - The number of open files: it's relatively low (both the system's total > and or fds opened by beam.smp). > https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png > > - The usage of RAM, total used and used by beam.smp > https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png > It continuously grows, with regular spikes, until killing CouchDB with an > OOM. After restart, the RAM usage is nice and low, and no spikes. > > - /_node/_local/_system metrics, before and after restart. Values that > significantly differ (before / after restart) are listed here: > - uptime (obviously ;-)) > - memory.processes : + 3732 % > - memory.processes_used : + 3735 % > - memory.binary : + 17700 % > - context_switches : + 17376 % > - reductions : + 867832 % > - garbage_collection_count : + 448248 % > - words_reclaimed : + 112755 % > - io_input : + 44226 % > - io_output : + 157951 % > > Before CouchDB restart: > { > "uptime":2712973, > "memory":{ > "other":7250289, > "atom":512625, > "atom_used":510002, > "processes":1877591424, > "processes_used":1877504920, > "binary":177468848, > "code":9653286, > "ets":16012736 > }, > "run_queue":0, > "ets_table_count":102, > "context_switches":1621495509, > "reductions":968705947589, > "garbage_collection_count":331826928, > "words_reclaimed":269964293572, > "io_input":8812455, > "io_output":20733066, > ... > > After CouchDB restart: > { > "uptime":206, > "memory":{ > "other":6907493, > "atom":512625, > "atom_used":497769, > "processes":49001944, > "processes_used":48963168, > "binary":997032, > "code":9233842, > "ets":4779576 > }, > "run_queue":0, > "ets_table_count":102, > "context_switches":1015486, > "reductions":111610788, > "garbage_collection_count":74011, > "words_reclaimed":239214127, > "io_input":19881, > "io_output":13118, > ... > > Adrien > > Le ven. 14 juin 2019 à 15:11, Jérôme Augé <jerome.a...@anakeen.com> a > écrit : > > > Ok, so I'll setup a cron job to journalize (every minute?) the output > from > > "/_node/_local/_system" and wait for the next OOM kill. > > > > Any property from "_system" to look for in particular? > > > > Here is a link to the memory usage graph: > > https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png > > > > The memory usage varies, but the general trend is to go up with some > > regularity over a week until we reach OOM. When "beam.smp" is killed, > it's > > reported as consuming 15 GB (as seen in the kernel's OOM trace in > syslog). > > > > Thanks, > > Jérôme > > > > Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <kocol...@apache.org> a > > écrit : > > > > > Hi Jérôme, > > > > > > Thanks for a well-written and detailed report (though the mailing list > > > strips attachments). The _system endpoint provides a lot of useful data > > for > > > debugging these kinds of situations; do you have a snapshot of the > output > > > when the system was consuming a lot of memory? > > > > > > > > > > > > http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system > > > > > > Adam > > > > > > > On Jun 14, 2019, at 5:44 AM, Jérôme Augé <jerome.a...@anakeen.com> > > > wrote: > > > > > > > > Hi, > > > > > > > > I'm having a hard time figuring out the high memory usage of a > CouchDB > > > server. > > > > > > > > What I'm observing is that the memory consumption from the "beam.smp" > > > process gradually rises until it triggers the kernel's OOM > > (Out-Of-Memory) > > > which kill the "beam.smp" process. > > > > > > > > It also seems that many databases are not compacted: I've made a > script > > > to iterate over the databases to compute de fragmentation factor, and > it > > > seems I have around 2100 databases with a frag > 70%. > > > > > > > > We have a single CouchDB v2.1.1server (configured with q=8 n=1) and > > > around 2770 databases. > > > > > > > > The server initially had 4 GB of RAM, and we are now with 16 GB w/ 8 > > > vCPU, and it still regularly reaches OOM. From the monitoring I see > that > > > with 16 GB the OOM is almost triggered once per week (c.f. attached > > graph). > > > > > > > > The memory usage seems to increase gradually until it reaches OOM. > > > > > > > > The Couch server is mostly used by web clients with the PouchDB JS > API. > > > > > > > > We have ~1300 distinct users and by monitoring the netstat/TCP > > > established connections I guess we have around 100 (maximum) users at > any > > > given time. From what I understanding of the application's logic, each > > user > > > access 2 private databases (read/write) + 1 common database > (read-only). > > > > > > > > On-disk usage of CouchDB's data directory is around 40 GB. > > > > > > > > Any ideas on what could cause such behavior (increasing memory usage > > > over the course of a week)? Or how to find what is happening behind the > > > scene? > > > > > > > > Regards, > > > > Jérôme > > > > > >