Hi Jérôme and Adam,

That's funny, because I'm investigating the exact same problem these days.
We have a two CouchDB setups:
- a one-node server (q=2 n=1) with 5000 databases
- a 3-node cluster (q=2 n=3) with 50000 databases

... and we are experiencing the problem on both setups. We've been having
this problem for at least 3-4 months.

We've monitored:

- The number of open files: it's relatively low (both the system's total
and or fds opened by beam.smp).
  https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png

- The usage of RAM, total used and used by beam.smp
  https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png
  It continuously grows, with regular spikes, until killing CouchDB with an
OOM. After restart, the RAM usage is nice and low, and no spikes.

- /_node/_local/_system metrics, before and after restart. Values that
significantly differ (before / after restart) are listed here:
  - uptime (obviously ;-))
  - memory.processes : + 3732 %
  - memory.processes_used : + 3735 %
  - memory.binary : + 17700 %
  - context_switches : + 17376 %
  - reductions : + 867832 %
  - garbage_collection_count : + 448248 %
  - words_reclaimed : + 112755 %
  - io_input : + 44226 %
  - io_output : + 157951 %

Before CouchDB restart:
{
  "uptime":2712973,
  "memory":{
    "other":7250289,
    "atom":512625,
    "atom_used":510002,
    "processes":1877591424,
    "processes_used":1877504920,
    "binary":177468848,
    "code":9653286,
    "ets":16012736
  },
  "run_queue":0,
  "ets_table_count":102,
  "context_switches":1621495509,
  "reductions":968705947589,
  "garbage_collection_count":331826928,
  "words_reclaimed":269964293572,
  "io_input":8812455,
  "io_output":20733066,
  ...

After CouchDB restart:
{
  "uptime":206,
  "memory":{
    "other":6907493,
    "atom":512625,
    "atom_used":497769,
    "processes":49001944,
    "processes_used":48963168,
    "binary":997032,
    "code":9233842,
    "ets":4779576
  },
  "run_queue":0,
  "ets_table_count":102,
  "context_switches":1015486,
  "reductions":111610788,
  "garbage_collection_count":74011,
  "words_reclaimed":239214127,
  "io_input":19881,
  "io_output":13118,
  ...

Adrien

Le ven. 14 juin 2019 à 15:11, Jérôme Augé <jerome.a...@anakeen.com> a
écrit :

> Ok, so I'll setup a cron job to journalize (every minute?) the output from
> "/_node/_local/_system" and wait for the next OOM kill.
>
> Any property from "_system" to look for in particular?
>
> Here is a link to the memory usage graph:
> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png
>
> The memory usage varies, but the general trend is to go up with some
> regularity over a week until we reach OOM. When "beam.smp" is killed, it's
> reported as consuming 15 GB (as seen in the kernel's OOM trace in syslog).
>
> Thanks,
> Jérôme
>
> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski <kocol...@apache.org> a
> écrit :
>
> > Hi Jérôme,
> >
> > Thanks for a well-written and detailed report (though the mailing list
> > strips attachments). The _system endpoint provides a lot of useful data
> for
> > debugging these kinds of situations; do you have a snapshot of the output
> > when the system was consuming a lot of memory?
> >
> >
> >
> http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system
> >
> > Adam
> >
> > > On Jun 14, 2019, at 5:44 AM, Jérôme Augé <jerome.a...@anakeen.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > I'm having a hard time figuring out the high memory usage of a CouchDB
> > server.
> > >
> > > What I'm observing is that the memory consumption from the "beam.smp"
> > process gradually rises until it triggers the kernel's OOM
> (Out-Of-Memory)
> > which kill the "beam.smp" process.
> > >
> > > It also seems that many databases are not compacted: I've made a script
> > to iterate over the databases to compute de fragmentation factor, and it
> > seems I have around 2100 databases with a frag > 70%.
> > >
> > > We have a single CouchDB v2.1.1server (configured with q=8 n=1) and
> > around 2770 databases.
> > >
> > > The server initially had 4 GB of RAM, and we are now with 16 GB w/ 8
> > vCPU, and it still regularly reaches OOM. From the monitoring I see that
> > with 16 GB the OOM is almost triggered once per week (c.f. attached
> graph).
> > >
> > > The memory usage seems to increase gradually until it reaches OOM.
> > >
> > > The Couch server is mostly used by web clients with the PouchDB JS API.
> > >
> > > We have ~1300 distinct users and by monitoring the netstat/TCP
> > established connections I guess we have around 100 (maximum) users at any
> > given time. From what I understanding of the application's logic, each
> user
> > access 2 private databases (read/write) + 1 common database (read-only).
> > >
> > > On-disk usage of CouchDB's data directory is around 40 GB.
> > >
> > > Any ideas on what could cause such behavior (increasing memory usage
> > over the course of a week)? Or how to find what is happening behind the
> > scene?
> > >
> > > Regards,
> > > Jérôme
> >
>

Reply via email to