Re: 100% CPU on only a single node because of couchjs processes

Geoffrey Cox Wed, 06 Dec 2017 06:56:34 -0800

Interesting, I read somewhere that having a view per ddoc is more
efficient. Thanks for clarifying!
On Wed, Dec 6, 2017 at 1:31 AM Jan Lehnardt <[email protected]> wrote:


>
> > On 5. Dec 2017, at 21:13, Geoffrey Cox <[email protected]> wrote:
> >
> > Hey Adam,
> >
> > Attached is my local.ini and the design doc with the view JS.
> >
> > Please see my responses below:
> >
> > Thanks for the help!
> >
> > On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <[email protected]>
> wrote:
> > Hi Geoff, a couple of additional questions:
> >
> > 1) Are you making these view requests with stale=ok or
> stale=update_after?
> > GC: I am not using the stale parameter
> > 2) What are you using for N and Q in the [cluster] configuration
> settings?
> > GC: As per the attached local.ini, I specified n=2 and am using the
> default q=8.
> > 3) Did you take advantage of the (barely-documented) “zones" attribute
> when defining cluster members?
> > GC: As per the attached local.ini, I have *not* specified this option.
> > 4) Do you have any other JS code besides the view definitions?
> > GC: When you refer to JS code, I think you mean in terms of JS code "in"
> CouchDB and if that is the case then my only JS code is very simple views
> like those in the attached view.json. (I know that I really need to break
> out the views so that there is one view per doc, but I haven't quite gotten
> around to refactoring this and I don't believe this is causing the CPU
> usage)
>
> Quick comment on one or multiple view(s)-per-ddoc: this is a performance
> trade-off and not either one is always correct. But generally, I would
> recommend grouping all views an app would need into a single ddoc.
>
> For each ddoc, all docs in a database have to be serialised and shipped to
> couchjs and the results are shipped back, that’s the bulk of the work in
> view indexing. Evaluating a single map/reduce function is comparatively
> minuscule, so grouping views in a single ddoc makes that more efficient.
>
>
>
> >
> > Regarding #1, the cluster will actually select shards differently
> depending on the use of those query parameters. When your request
> stipulates that you’re OK with stale results the cluster *will* select a
> “primary” copy in order to improve the consistency of repeated requests to
> the same view. The algorithm for choosing those primary copies is somewhat
> subtle hence my question #3.
> >
> > If you’re not using stale requests I have a much harder time explaining
> why the 100% CPU issue would migrate from node to node like that.
> >
> > Adam
> >
> > > On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <[email protected]> wrote:
> > >
> > > Thanks for the responses, any other thoughts?
> > >
> > > FYI: I’m trying to work on a very focused test case that I can share
> with
> > > the Dev team, but it is taking a little while to narrow down the exact
> > > cause.
> > > On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <
> [email protected]>
> > > wrote:
> > >
> > >> Sorry to contradict you, but Cloudant deploys clusters across amazon
> AZ's
> > >> as standard. It's fast enough. It's cross-region that you need to
> avoid.
> > >>
> > >> B.
> > >>
> > >>> On 5 Dec 2017, at 09:11, Jan Lehnardt <[email protected]> wrote:
> > >>>
> > >>> Heya Geoff,
> > >>>
> > >>> a CouchDB cluster is designed to run in the same data center / with
> > >> local are networking latencies. A cluster across AWS Availability
> Zones
> > >> won’t work as you see. If you want CouchDB’s in both AZs, use regular
> > >> replication and keep the clusters local to the AZ.
> > >>>
> > >>> Best
> > >>> Jan
> > >>> --
> > >>>
> > >>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <[email protected]> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> I've spent days using trial and error to try and figure out why I am
> > >>>> getting a very high CPU load on only a single node in my cluster.
> I'm
> > >>>> hoping someone has an idea of what is going on as I'm getting stuck.
> > >>>>
> > >>>> Here's my configuration:
> > >>>>
> > >>>> 1. 2 node cluster:
> > >>>>    1. Each node is located in a different AWS availability zone
> > >>>>    2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
> > >>>> 2. A haproxy server is load balancing traffic to the nodes using
> round
> > >>>> robin
> > >>>>
> > >>>> The problem:
> > >>>>
> > >>>> 1. After users make changes via PouchDB, a backend runs a number of
> > >>>> routines that use views to calculate notifications. The issue is
> that
> > >> on a
> > >>>> single node, the couchjs processes stack up and then start to
> consume
> > >>>> nearly all the available CPU. This server then becomes the
> "workhorse"
> > >> that
> > >>>> always does *all* the heavy duty couchjs processing until I restart
> > >> this
> > >>>> node.
> > >>>> 2. It is important to note that both nodes have couchjs processes,
> but
> > >>>> it is only a single node that has the couchjs processes that are
> using
> > >> 100%
> > >>>> CPU
> > >>>> 3. I've even resorted to setting `os_process_limit = 10` and this
> just
> > >>>> results in each couchjs process taking over 10% each! In other
> words,
> > >> the
> > >>>> couchjs processes just eat up all the CPU no matter how many couchjs
> > >>>> process there are!
> > >>>> 4. The CPU usage will eventually clear after all the processing is
> > >> done,
> > >>>> but then as soon as there is more to process the workhorse node will
> > >> get
> > >>>> bogged down again.
> > >>>> 5. If I restart the workhorse node, the other node then becomes the
> > >>>> workhorse node. This is the only way to get the couchjs processes to
> > >> "move"
> > >>>> to another node.
> > >>>> 6. The problem is that this design is not scalable as only one node
> can
> > >>>> be the workhorse node at any given time. Moreover this causes
> specific
> > >>>> instances to run out of CPU credits. Shouldn't the couchjs
> processes be
> > >>>> spread out over all my nodes? From what I can tell, if I add more
> > >> nodes I'm
> > >>>> still going to have the issue where only one of the nodes is getting
> > >> bogged
> > >>>> down. Is it possible that the problem is that I have 2 nodes and
> > >> really I
> > >>>> need at least 3 nodes? (I know a 2-node cluster is not very typical)
> > >>>>
> > >>>>
> > >>>> Things I've checked:
> > >>>>
> > >>>> 1. Ensured that the load balancing is working, i.e. haproxy is
> indeed
> > >>>> distributing traffic accordingly
> > >>>> 2. I've tried setting `os_process_limit = 10` and
> > >> `os_process_soft_limit
> > >>>> = 5` to see if I could force a more conservative usage of couchjs
> > >>>> processes, but instead the couchjs processes just consume all the
> CPU
> > >> load.
> > >>>> 3. I've tried simulating the issue locally with VMs and I cannot
> > >>>> duplicate any such load. My guess is that this is because the nodes
> are
> > >>>> located on the same box so hop distance between nodes is very small
> and
> > >>>> this somehow keeps the CPU usage to a minimum
> > >>>> 4. I've tried isolating the issue by creating short code snippets
> that
> > >>>> intentionally try to spawn a lot of couchjs processes and they are
> > >> spawned
> > >>>> but don't consume 100% CPU
> > >>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and
> this
> > >>>> doesn't seem to change anything
> > >>>> 6. The only error entries in my CouchDB logs are like the following
> and
> > >>>> I don't believe they are related to my issue:
> > >>>>    1.
> > >>>>
> > >>>>    [error] 2017-12-04T18:13:38.728970Z [email protected]
> > >> <0.13974.79>
> > >>>>    4b0b21c664 rexi_server: from: [email protected](<0.20638.79>)
> > >> mfa:
> > >>>>    fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed
> to
> > >> access
> > >>>>    this db.">>}
> > >>>>
> > >>
> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> > >>>>
> > >>>> Does CouchDB have some logic built in that spawns a number of
> couchjs
> > >>>> processes on a "primary" node? Will future view processing then
> always
> > >> be
> > >>>> routed to this "primary" node?
> > >>>>
> > >>>> Is there a way to better distribute these heavy duty couchjs
> processes?
> > >> Is
> > >>>> it possible to limit their CPU consumption? (I'm hesitant to start
> down
> > >> the
> > >>>> path of using something like cpulimit as I think there is a root
> problem
> > >>>> that needs to be addressed)
> > >>>>
> > >>>> I'm running out of ideas and hope that someone has some notion of
> what
> > >> is
> > >>>> causing this bizarre load or if there is a bug in CouchDB.
> > >>>>
> > >>>> Thank you for any help you can provide!
> > >>>>
> > >>>> Geoff
> > >>>
> > >>> --
> > >>> Professional Support for Apache CouchDB:
> > >>> https://neighbourhood.ie/couchdb-support/
> > >>>
> > >>
> > >>
> >
> > <views.json>
>
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
>

Re: 100% CPU on only a single node because of couchjs processes

Reply via email to