Vladimir, I would love to have your debugging skills and feedback on this, thanks for proposing! Unfortunately this only happens on real servers, after weeks of continuous real-life requests. In the past, we tried to reproduce it on test servers, but the memory leak doesn't happen if CouchDB is not very active (or it happens, but unnoticeable because too slowly). And these servers contain protected data that our rules don't allow us to share.
I also searched for the crash dump (sudo find / -name '*.dump'; sudo find / -name 'erl_crash*') but couldn't find it; do you where it could be located? We already have swap on these machines. Next time the system comes close the OOM point, I will try to see whether they use swap or not. Le mer. 26 juin 2019 à 12:43, Vladimir Ralev <vladimir.ra...@gmail.com> a écrit : > Ouch. I have an idea, can you add a bunch of swap on one of those machines, > say 20gigs, this should allow the machines to work for a little longer in > slow mode instead of running out of memory, which will buy you time to run > more diagnostics after the incident occurs. This will probably reduce the > response times a lot though and might break your apps. > > Also can you upload that erl_crash.dump file that the crash generated? > > PS I would love to get a shell access to a system like that, if you can > reproduce the issue on a test machine and give me access I should be able > to come up with something. Free of charge. > > On Wed, Jun 26, 2019 at 1:17 PM Adrien Vergé <adrien.ve...@tolteck.com> > wrote: > > > Hi all, > > > > Here is more feedback, since one of our CouchDB servers crashed last > night. > > > > - Setup: CouchDB 2.3.1 on a 3-node cluster (q=2 n=3) with ~50k small > > databases. > > > > - Only one of the 3 nodes crashed. Others should crash in a few days > (they > > usually crash and restart every ~3 weeks). > > > > - update_lru_on_read = false > > > > - The extra memory consumption comes from beam.smp process (see graph > > below). > > > > - The crash is an OOM, see the last log lines before restart: > > > > eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type > > "old_heap"). > > Crash dump is being written to: erl_crash.dump... > > [os_mon] memory supervisor port (memsup): Erlang has closed > > [os_mon] cpu supervisor port (cpu_sup): Erlang has closed > > > > - Over last weeks, beam.smp memory usage kept increasing and increasing. > > See > > the graph I made at https://framapic.org/jnHAyVEKq98k/kXCQv3pyUdz0.png > > > > - /_node/_local/_system metrics look normal. The difference between an > > "about > > to crash" node and a "freshly restarted and lots of free RAM" node, is > in > > uptime, memory.processes_used, memory.binary, context_switches, > > reductions, > > garbage_collection_count, io_input... as previously discussed with > Adam. > > > > - This command gives exactly the same output on an "about to crash" node, > > than > > on a "freshly restarted and lots of free RAM" node: > > > > MQSizes2 = lists:map(fun(A) -> {_,B} = case > > process_info(A,total_heap_size) > > of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} end, > > {B,A} > > end, processes()). > > > > Le jeu. 20 juin 2019 à 17:08, Jérôme Augé <jerome.a...@anakeen.com> a > > écrit : > > > > > We are going to plan an upgrade from 2.1.1 to 2.3.1 in the coming > weeks. > > > > > > I have a side question concerning CouchDB's upgrades: is the database > > > binary compatible between v2.1.1 and v2.3.1? In the case we ever need > to > > > downgrade back to 2.1.1, do the binary data can be kept? > > > > > > Regards, > > > Jérôme > > > > > > Le mer. 19 juin 2019 à 08:59, Jérôme Augé <jerome.a...@anakeen.com> a > > > écrit : > > > > > > > Thanks Adam for your explanations! > > > > > > > > The "update_lru_on_read" is already set to false on this instance (I > > had > > > > already seen the comments on these pull-requests). > > > > > > > > We are effectively running an "old" 2.1.1 version, and we have > advised > > > the > > > > client that an upgrade might be needed to sort out (or further > > > investigate) > > > > these problems. > > > > > > > > Thanks again, > > > > Jérôme > > > > > > > > > > > > > > > > Le mar. 18 juin 2019 à 18:59, Adam Kocoloski <kocol...@apache.org> a > > > > écrit : > > > > > > > >> Hi Jérôme, definitely useful. > > > >> > > > >> The “run_queue” is the number of Erlang processes in a runnable > state > > > >> that are not currently executing on a scheduler. When that value is > > > greater > > > >> than zero it means the node is hitting some compute limitations. > > Seeing > > > a > > > >> small positive value from time to time is no problem. > > > >> > > > >> Your last six snapshots show a message queue backlog in > couch_server. > > > >> That could be what caused the node to OOM. The couch_server process > > is a > > > >> singleton and if it accumulates a large message backlog there are > > > limited > > > >> backpressure or scaling mechanisms to help it recover. I noticed > > you’re > > > >> running 2.1.1; there were a couple of important enhancements to > reduce > > > the > > > >> message flow through couch_server in more recent releases: > > > >> > > > >> 2.2.0: https://github.com/apache/couchdb/pull/1118 < > > > >> https://github.com/apache/couchdb/pull/1118> > > > >> 2.3.1: https://github.com/apache/couchdb/pull/1593 < > > > >> https://github.com/apache/couchdb/pull/1593> > > > >> > > > >> The change in 2.2.0 is just a change in the default configuration; > you > > > >> can try applying it to your server by setting: > > > >> > > > >> [couchdb] > > > >> update_lru_on_read = false > > > >> > > > >> The changes in 2.3.1 offer additional benefits for couch_server > > message > > > >> throughput but you’ll need to upgrade to get them. > > > >> > > > >> Cheers, Adam > > > >> > > > >> P.S. II don’t know what’s going on with the negative memory.other > > value > > > >> there, it’s not intentionally meaningful :) > > > >> > > > >> > > > >> > On Jun 18, 2019, at 11:30 AM, Jérôme Augé < > jerome.a...@anakeen.com> > > > >> wrote: > > > >> > > > > >> > "beam.smp" just got killed by OOM, but I was not in front of the > > > >> machine to > > > >> > perform this command... > > > >> > > > > >> > However, here is the CouchDB log of "/_node/_local/_system" for > the > > 30 > > > >> > minutes preceding the OOM: > > > >> > - > > > >> > > > > >> > > > > > > https://gist.github.com/eguaj/1fba3eda4667a999fa691ff1902f04fc#file-log-couchdb-system-2019-06-18-log > > > >> > > > > >> > I guess the spike that triggers the OOM is so quick (< 1min) that > it > > > >> does > > > >> > not gets logged (I log every minute). > > > >> > > > > >> > Is there anything that can be used/deduced from the last line > logged > > > at > > > >> > 2019-06-18T16:00:14+0200? > > > >> > > > > >> > At 15:55:25, the "run_queue" is at 36: what does it means? Number > of > > > >> active > > > >> > concurrent requests? > > > >> > > > > >> > From 15:56 to 16:00 the "memory"."other" value is a negative > value: > > > >> does it > > > >> > means something special? or just an integer overflow? > > > >> > > > > >> > > > > >> > > > > >> > Le lun. 17 juin 2019 à 14:09, Vladimir Ralev < > > > vladimir.ra...@gmail.com> > > > >> a > > > >> > écrit : > > > >> > > > > >> >> Alright, I think the issue will be more visible towards the OOM > > > point, > > > >> >> however for now since you have the system live with a leak, it > will > > > be > > > >> >> useful to repeat the same steps, but replace > > > >> >> "message_queue_len" with "total_heap_size" then with "heap_size" > > then > > > >> with > > > >> >> "stack_size" and then with "reductions". > > > >> >> > > > >> >> For example: > > > >> >> > > > >> >> MQSizes2 = lists:map(fun(A) -> {_,B} = case > > > >> process_info(A,total_heap_size) > > > >> >> of {XV,XB} -> {XV, XB}; _ERR -> io:format("~p",[_ERR]),{ok, 0} > end, > > > >> {B,A} > > > >> >> end, processes()). > > > >> >> > > > >> >> Then same with the other params. > > > >> >> > > > >> >> That can shed some light, otherwise someone will need to monitor > > > >> process > > > >> >> count and go into them by age and memory patterns. > > > >> >> > > > >> >> On Mon, Jun 17, 2019 at 2:55 PM Jérôme Augé < > > jerome.a...@anakeen.com > > > > > > > >> >> wrote: > > > >> >> > > > >> >>> The 2G consumption is from Adrien's system. > > > >> >>> > > > >> >>> On mine, since I setup the logging of "/_node/_local/_system" > > > output : > > > >> >>> - on june 14th max memory.processes was 2.6 GB > > > >> >>> - on june 15th max memory.processes was 4.7 GB > > > >> >>> - on june 16th max memory.processes was 7.0 GB > > > >> >>> - today (june 17th) max memory.processes was 8.0 GB (and with an > > > >> >>> interactive top I see spikes at 12 GB) > > > >> >>> > > > >> >>> The memory.processes seems to be steadily increasing over the > > days, > > > >> and > > > >> >> I'm > > > >> >>> soon expecting the out-of-memory condition to be triggered in a > > > >> couple of > > > >> >>> days. > > > >> >>> > > > >> >>> Le lun. 17 juin 2019 à 11:53, Vladimir Ralev < > > > >> vladimir.ra...@gmail.com> > > > >> >> a > > > >> >>> écrit : > > > >> >>> > > > >> >>>> Nothing to see here, the message queue stat from Adam's advice > is > > > >> >>> accurate. > > > >> >>>> Note that you should run this only when there is already an > > > >> >> unreasonable > > > >> >>>> amount memory leaked/consumed. > > > >> >>>> > > > >> >>>> But now I realise you had "processes":1877591424 before restart > > > from > > > >> >> the > > > >> >>>> stats above which is less than 2G. Are you using only 2 gigs of > > > RAM? > > > >> I > > > >> >>> got > > > >> >>>> confused by the initial comment and I thought you had 15GB RAM. > > If > > > >> you > > > >> >>> are > > > >> >>>> only using 2 gigs of RAM, it's probably not enough for your > > > workload. > > > >> >>>> > > > >> >>>> On Mon, Jun 17, 2019 at 12:15 PM Jérôme Augé < > > > >> jerome.a...@anakeen.com> > > > >> >>>> wrote: > > > >> >>>> > > > >> >>>>> That command seems to work, and here is the output: > > > >> >>>>> > > > >> >>>>> --8<-- > > > >> >>>>> # /opt/couchdb/bin/remsh < debug.2.remsh > > > >> >>>>> Eshell V7.3 (abort with ^G) > > > >> >>>>> (remsh22574@127.0.0.1)1> [{0,<0.0.0>}, > > > >> >>>>> {0,<0.3.0>}, > > > >> >>>>> {0,<0.6.0>}, > > > >> >>>>> {0,<0.7.0>}, > > > >> >>>>> {0,<0.9.0>}, > > > >> >>>>> {0,<0.10.0>}, > > > >> >>>>> {0,<0.11.0>}, > > > >> >>>>> {0,<0.12.0>}, > > > >> >>>>> {0,<0.14.0>}, > > > >> >>>>> {0,<0.15.0>}, > > > >> >>>>> {0,<0.16.0>}, > > > >> >>>>> {0,<0.17.0>}, > > > >> >>>>> {0,<0.18.0>}, > > > >> >>>>> {0,<0.19.0>}, > > > >> >>>>> {0,<0.20.0>}, > > > >> >>>>> {0,<0.21.0>}, > > > >> >>>>> {0,<0.22.0>}, > > > >> >>>>> {0,<0.23.0>}, > > > >> >>>>> {0,<0.24.0>}, > > > >> >>>>> {0,<0.25.0>}, > > > >> >>>>> {0,<0.26.0>}, > > > >> >>>>> {0,<0.27.0>}, > > > >> >>>>> {0,<0.28.0>}, > > > >> >>>>> {0,<0.29.0>}, > > > >> >>>>> {0,<0.31.0>}, > > > >> >>>>> {0,<0.32.0>}, > > > >> >>>>> {0,<0.33.0>}, > > > >> >>>>> {0,...}, > > > >> >>>>> {...}] > > > >> >>>>> (remsh22574@127.0.0.1)2> {0,<0.38.0>} > > > >> >>>>> (remsh22574@127.0.0.1)3> > > > [{current_function,{erl_eval,do_apply,6}}, > > > >> >>>>> {initial_call,{erlang,apply,2}}, > > > >> >>>>> {status,running}, > > > >> >>>>> {message_queue_len,0}, > > > >> >>>>> {messages,[]}, > > > >> >>>>> {links,[<0.32.0>]}, > > > >> >>>>> {dictionary,[]}, > > > >> >>>>> {trap_exit,false}, > > > >> >>>>> {error_handler,error_handler}, > > > >> >>>>> {priority,normal}, > > > >> >>>>> {group_leader,<0.31.0>}, > > > >> >>>>> {total_heap_size,5172}, > > > >> >>>>> {heap_size,2586}, > > > >> >>>>> {stack_size,24}, > > > >> >>>>> {reductions,24496}, > > > >> >>>>> {garbage_collection,[{min_bin_vheap_size,46422}, > > > >> >>>>> {min_heap_size,233}, > > > >> >>>>> {fullsweep_after,65535}, > > > >> >>>>> {minor_gcs,1}]}, > > > >> >>>>> {suspending,[]}] > > > >> >>>>> (remsh22574@127.0.0.1)4> *** Terminating erlang (' > > > >> >> remsh22574@127.0.0.1 > > > >> >>> ') > > > >> >>>>> -->8-- > > > >> >>>>> > > > >> >>>>> What should I be looking for in this output? > > > >> >>>>> > > > >> >>>>> Le ven. 14 juin 2019 à 17:30, Vladimir Ralev < > > > >> >> vladimir.ra...@gmail.com > > > >> >>>> > > > >> >>>> a > > > >> >>>>> écrit : > > > >> >>>>> > > > >> >>>>>> That means your couch is creating and destroying processes > too > > > >> >>>> rapidly. I > > > >> >>>>>> haven't seen this, however I think Adam's message_queues stat > > > above > > > >> >>>> does > > > >> >>>>>> the same thing. I didn't notice you can get it from there. > > > >> >>>>>> > > > >> >>>>>> Either way it will be useful if you can get the shell to > work: > > > >> >>>>>> Try this command instead for the first, the rest will be the > > > same: > > > >> >>>>>> > > > >> >>>>>> MQSizes2 = lists:map(fun(A) -> {_,B} = case > > > >> >>>>>> process_info(A,message_queue_len) of {XV,XB} -> {XV, XB}; > _ERR > > -> > > > >> >>>>>> io:format("~p",[_ERR]),{ok, 0} end, {B,A} end, processes()). > > > >> >>>>>> > > > >> >>>>>> On Fri, Jun 14, 2019 at 5:52 PM Jérôme Augé < > > > >> >> jerome.a...@anakeen.com > > > >> >>>> > > > >> >>>>>> wrote: > > > >> >>>>>> > > > >> >>>>>>> I tried the following, but it seems to fail on the first > > > command: > > > >> >>>>>>> > > > >> >>>>>>> --8<-- > > > >> >>>>>>> # /opt/couchdb/bin/remsh > > > >> >>>>>>> Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8] > > > >> >>>>>>> [async-threads:10] [hipe] [kernel-poll:false] > > > >> >>>>>>> > > > >> >>>>>>> Eshell V7.3 (abort with ^G) > > > >> >>>>>>> (couchdb@127.0.0.1)1> MQSizes2 = lists:map(fun(A) -> {_,B} > = > > > >> >>>>>>> process_info(A,message_queue_len), {B,A} end, processes()). > > > >> >>>>>>> ** exception error: no match of right hand side value > > undefined > > > >> >>>>>>> -->8-- > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> Le ven. 14 juin 2019 à 16:08, Vladimir Ralev < > > > >> >>>> vladimir.ra...@gmail.com > > > >> >>>>>> > > > >> >>>>>> a > > > >> >>>>>>> écrit : > > > >> >>>>>>> > > > >> >>>>>>>> Hey guys. I bet it's a mailbox leaking memory. I am very > > > >> >>> interested > > > >> >>>>> in > > > >> >>>>>>>> debugging issues like this too. > > > >> >>>>>>>> > > > >> >>>>>>>> I can suggest to get an erlang shell and run these commands > > to > > > >> >>> see > > > >> >>>>> the > > > >> >>>>>>> top > > > >> >>>>>>>> memory consuming processes > > > >> >>>>>>>> > > > >> >>> > > https://www.mail-archive.com/user@couchdb.apache.org/msg29365.html > > > >> >>>>>>>> > > > >> >>>>>>>> One issue I will be reporting soon is if one of your nodes > is > > > >> >>> down > > > >> >>>>> for > > > >> >>>>>>> some > > > >> >>>>>>>> amount of time, it seems like all databases independently > try > > > >> >> and > > > >> >>>>> retry > > > >> >>>>>>> to > > > >> >>>>>>>> query the missing node and fail, resulting in printing a > lot > > of > > > >> >>>> logs > > > >> >>>>>> for > > > >> >>>>>>>> each db which can overwhelm the logger process. If you > have a > > > >> >> lot > > > >> >>>> of > > > >> >>>>>> DBs > > > >> >>>>>>>> this makes the problem worse, but it doesn't happen right > > away > > > >> >>> for > > > >> >>>>> some > > > >> >>>>>>>> reason. > > > >> >>>>>>>> > > > >> >>>>>>>> On Fri, Jun 14, 2019 at 4:25 PM Adrien Vergé < > > > >> >>>>> adrien.ve...@tolteck.com > > > >> >>>>>>> > > > >> >>>>>>>> wrote: > > > >> >>>>>>>> > > > >> >>>>>>>>> Hi Jérôme and Adam, > > > >> >>>>>>>>> > > > >> >>>>>>>>> That's funny, because I'm investigating the exact same > > > >> >> problem > > > >> >>>>> these > > > >> >>>>>>>> days. > > > >> >>>>>>>>> We have a two CouchDB setups: > > > >> >>>>>>>>> - a one-node server (q=2 n=1) with 5000 databases > > > >> >>>>>>>>> - a 3-node cluster (q=2 n=3) with 50000 databases > > > >> >>>>>>>>> > > > >> >>>>>>>>> ... and we are experiencing the problem on both setups. > > We've > > > >> >>>> been > > > >> >>>>>>> having > > > >> >>>>>>>>> this problem for at least 3-4 months. > > > >> >>>>>>>>> > > > >> >>>>>>>>> We've monitored: > > > >> >>>>>>>>> > > > >> >>>>>>>>> - The number of open files: it's relatively low (both the > > > >> >>>> system's > > > >> >>>>>>> total > > > >> >>>>>>>>> and or fds opened by beam.smp). > > > >> >>>>>>>>> https://framapic.org/wQUf4fLhNIm7/oa2VHZyyoPp9.png > > > >> >>>>>>>>> > > > >> >>>>>>>>> - The usage of RAM, total used and used by beam.smp > > > >> >>>>>>>>> https://framapic.org/DBWIhX8ZS8FU/MxbS3BmO0WpX.png > > > >> >>>>>>>>> It continuously grows, with regular spikes, until killing > > > >> >>>> CouchDB > > > >> >>>>>>> with > > > >> >>>>>>>> an > > > >> >>>>>>>>> OOM. After restart, the RAM usage is nice and low, and no > > > >> >>> spikes. > > > >> >>>>>>>>> > > > >> >>>>>>>>> - /_node/_local/_system metrics, before and after restart. > > > >> >>> Values > > > >> >>>>>> that > > > >> >>>>>>>>> significantly differ (before / after restart) are listed > > > >> >> here: > > > >> >>>>>>>>> - uptime (obviously ;-)) > > > >> >>>>>>>>> - memory.processes : + 3732 % > > > >> >>>>>>>>> - memory.processes_used : + 3735 % > > > >> >>>>>>>>> - memory.binary : + 17700 % > > > >> >>>>>>>>> - context_switches : + 17376 % > > > >> >>>>>>>>> - reductions : + 867832 % > > > >> >>>>>>>>> - garbage_collection_count : + 448248 % > > > >> >>>>>>>>> - words_reclaimed : + 112755 % > > > >> >>>>>>>>> - io_input : + 44226 % > > > >> >>>>>>>>> - io_output : + 157951 % > > > >> >>>>>>>>> > > > >> >>>>>>>>> Before CouchDB restart: > > > >> >>>>>>>>> { > > > >> >>>>>>>>> "uptime":2712973, > > > >> >>>>>>>>> "memory":{ > > > >> >>>>>>>>> "other":7250289, > > > >> >>>>>>>>> "atom":512625, > > > >> >>>>>>>>> "atom_used":510002, > > > >> >>>>>>>>> "processes":1877591424, > > > >> >>>>>>>>> "processes_used":1877504920, > > > >> >>>>>>>>> "binary":177468848, > > > >> >>>>>>>>> "code":9653286, > > > >> >>>>>>>>> "ets":16012736 > > > >> >>>>>>>>> }, > > > >> >>>>>>>>> "run_queue":0, > > > >> >>>>>>>>> "ets_table_count":102, > > > >> >>>>>>>>> "context_switches":1621495509, > > > >> >>>>>>>>> "reductions":968705947589, > > > >> >>>>>>>>> "garbage_collection_count":331826928, > > > >> >>>>>>>>> "words_reclaimed":269964293572, > > > >> >>>>>>>>> "io_input":8812455, > > > >> >>>>>>>>> "io_output":20733066, > > > >> >>>>>>>>> ... > > > >> >>>>>>>>> > > > >> >>>>>>>>> After CouchDB restart: > > > >> >>>>>>>>> { > > > >> >>>>>>>>> "uptime":206, > > > >> >>>>>>>>> "memory":{ > > > >> >>>>>>>>> "other":6907493, > > > >> >>>>>>>>> "atom":512625, > > > >> >>>>>>>>> "atom_used":497769, > > > >> >>>>>>>>> "processes":49001944, > > > >> >>>>>>>>> "processes_used":48963168, > > > >> >>>>>>>>> "binary":997032, > > > >> >>>>>>>>> "code":9233842, > > > >> >>>>>>>>> "ets":4779576 > > > >> >>>>>>>>> }, > > > >> >>>>>>>>> "run_queue":0, > > > >> >>>>>>>>> "ets_table_count":102, > > > >> >>>>>>>>> "context_switches":1015486, > > > >> >>>>>>>>> "reductions":111610788, > > > >> >>>>>>>>> "garbage_collection_count":74011, > > > >> >>>>>>>>> "words_reclaimed":239214127, > > > >> >>>>>>>>> "io_input":19881, > > > >> >>>>>>>>> "io_output":13118, > > > >> >>>>>>>>> ... > > > >> >>>>>>>>> > > > >> >>>>>>>>> Adrien > > > >> >>>>>>>>> > > > >> >>>>>>>>> Le ven. 14 juin 2019 à 15:11, Jérôme Augé < > > > >> >>>> jerome.a...@anakeen.com > > > >> >>>>>> > > > >> >>>>>> a > > > >> >>>>>>>>> écrit : > > > >> >>>>>>>>> > > > >> >>>>>>>>>> Ok, so I'll setup a cron job to journalize (every > minute?) > > > >> >>> the > > > >> >>>>>> output > > > >> >>>>>>>>> from > > > >> >>>>>>>>>> "/_node/_local/_system" and wait for the next OOM kill. > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Any property from "_system" to look for in particular? > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Here is a link to the memory usage graph: > > > >> >>>>>>>>>> https://framapic.org/IzcD4Y404hlr/06rm0Ji4TpKu.png > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> The memory usage varies, but the general trend is to go > up > > > >> >>> with > > > >> >>>>>> some > > > >> >>>>>>>>>> regularity over a week until we reach OOM. When > "beam.smp" > > > >> >> is > > > >> >>>>>> killed, > > > >> >>>>>>>>> it's > > > >> >>>>>>>>>> reported as consuming 15 GB (as seen in the kernel's OOM > > > >> >>> trace > > > >> >>>> in > > > >> >>>>>>>>> syslog). > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Thanks, > > > >> >>>>>>>>>> Jérôme > > > >> >>>>>>>>>> > > > >> >>>>>>>>>> Le ven. 14 juin 2019 à 13:48, Adam Kocoloski < > > > >> >>>>> kocol...@apache.org> > > > >> >>>>>> a > > > >> >>>>>>>>>> écrit : > > > >> >>>>>>>>>> > > > >> >>>>>>>>>>> Hi Jérôme, > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> Thanks for a well-written and detailed report (though > the > > > >> >>>>> mailing > > > >> >>>>>>>> list > > > >> >>>>>>>>>>> strips attachments). The _system endpoint provides a lot > > > >> >> of > > > >> >>>>>> useful > > > >> >>>>>>>> data > > > >> >>>>>>>>>> for > > > >> >>>>>>>>>>> debugging these kinds of situations; do you have a > > > >> >> snapshot > > > >> >>>> of > > > >> >>>>>> the > > > >> >>>>>>>>> output > > > >> >>>>>>>>>>> when the system was consuming a lot of memory? > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>> > > > >> >>>>>> > > > >> >>>>> > > > >> >>>> > > > >> >>> > > > >> >> > > > >> > > > > > > http://docs.couchdb.org/en/stable/api/server/common.html#node-node-name-system > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>> Adam > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>>>> On Jun 14, 2019, at 5:44 AM, Jérôme Augé < > > > >> >>>>>>> jerome.a...@anakeen.com> > > > >> >>>>>>>>>>> wrote: > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> Hi, > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> I'm having a hard time figuring out the high memory > > > >> >> usage > > > >> >>>> of > > > >> >>>>> a > > > >> >>>>>>>>> CouchDB > > > >> >>>>>>>>>>> server. > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> What I'm observing is that the memory consumption from > > > >> >>> the > > > >> >>>>>>>> "beam.smp" > > > >> >>>>>>>>>>> process gradually rises until it triggers the kernel's > > > >> >> OOM > > > >> >>>>>>>>>> (Out-Of-Memory) > > > >> >>>>>>>>>>> which kill the "beam.smp" process. > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> It also seems that many databases are not compacted: > > > >> >> I've > > > >> >>>>> made > > > >> >>>>>> a > > > >> >>>>>>>>> script > > > >> >>>>>>>>>>> to iterate over the databases to compute de > fragmentation > > > >> >>>>> factor, > > > >> >>>>>>> and > > > >> >>>>>>>>> it > > > >> >>>>>>>>>>> seems I have around 2100 databases with a frag > 70%. > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> We have a single CouchDB v2.1.1server (configured with > > > >> >>> q=8 > > > >> >>>>> n=1) > > > >> >>>>>>> and > > > >> >>>>>>>>>>> around 2770 databases. > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> The server initially had 4 GB of RAM, and we are now > > > >> >> with > > > >> >>>> 16 > > > >> >>>>> GB > > > >> >>>>>>> w/ > > > >> >>>>>>>> 8 > > > >> >>>>>>>>>>> vCPU, and it still regularly reaches OOM. From the > > > >> >>>> monitoring I > > > >> >>>>>> see > > > >> >>>>>>>>> that > > > >> >>>>>>>>>>> with 16 GB the OOM is almost triggered once per week > > > >> >> (c.f. > > > >> >>>>>> attached > > > >> >>>>>>>>>> graph). > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> The memory usage seems to increase gradually until it > > > >> >>>> reaches > > > >> >>>>>>> OOM. > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> The Couch server is mostly used by web clients with the > > > >> >>>>> PouchDB > > > >> >>>>>>> JS > > > >> >>>>>>>>> API. > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> We have ~1300 distinct users and by monitoring the > > > >> >>>>> netstat/TCP > > > >> >>>>>>>>>>> established connections I guess we have around 100 > > > >> >>> (maximum) > > > >> >>>>>> users > > > >> >>>>>>> at > > > >> >>>>>>>>> any > > > >> >>>>>>>>>>> given time. From what I understanding of the > > > >> >> application's > > > >> >>>>> logic, > > > >> >>>>>>>> each > > > >> >>>>>>>>>> user > > > >> >>>>>>>>>>> access 2 private databases (read/write) + 1 common > > > >> >> database > > > >> >>>>>>>>> (read-only). > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> On-disk usage of CouchDB's data directory is around 40 > > > >> >>> GB. > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> Any ideas on what could cause such behavior (increasing > > > >> >>>>> memory > > > >> >>>>>>>> usage > > > >> >>>>>>>>>>> over the course of a week)? Or how to find what is > > > >> >>> happening > > > >> >>>>>> behind > > > >> >>>>>>>> the > > > >> >>>>>>>>>>> scene? > > > >> >>>>>>>>>>>> > > > >> >>>>>>>>>>>> Regards, > > > >> >>>>>>>>>>>> Jérôme > > > >> >>>>>>>>>>> > > > >> >>>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>> > > > >> >>>>>> > > > >> >>>>> > > > >> >>>> > > > >> >>> > > > >> >> > > > >> > > > >> > > > > > >