Re: 100% CPU on only a single node because of couchjs processes

Geoffrey Cox Tue, 05 Dec 2017 12:13:56 -0800

Hey Adam,

Attached is my local.ini and the design doc with the view JS.


Please see my responses below:

Thanks for the help!

On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <[email protected]> wrote:

> Hi Geoff, a couple of additional questions:
>
> 1) Are you making these view requests with stale=ok or stale=update_after?
>
GC: I am not using the stale parameter

> 2) What are you using for N and Q in the [cluster] configuration settings?
>
GC: As per the attached local.ini, I specified n=2 and am using the default
q=8.

> 3) Did you take advantage of the (barely-documented) “zones" attribute
> when defining cluster members?
>
GC: As per the attached local.ini, I have *not* specified this option.

> 4) Do you have any other JS code besides the view definitions?
>
GC: When you refer to JS code, I think you mean in terms of JS code "in"
CouchDB and if that is the case then my only JS code is very simple views
like those in the attached view.json. (I know that I really need to break
out the views so that there is one view per doc, but I haven't quite gotten
around to refactoring this and I don't believe this is causing the CPU
usage)

>
> Regarding #1, the cluster will actually select shards differently
> depending on the use of those query parameters. When your request
> stipulates that you’re OK with stale results the cluster *will* select a
> “primary” copy in order to improve the consistency of repeated requests to
> the same view. The algorithm for choosing those primary copies is somewhat
> subtle hence my question #3.
>
> If you’re not using stale requests I have a much harder time explaining
> why the 100% CPU issue would migrate from node to node like that.
>
> Adam
>
> > On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <[email protected]> wrote:
> >
> > Thanks for the responses, any other thoughts?
> >
> > FYI: I’m trying to work on a very focused test case that I can share with
> > the Dev team, but it is taking a little while to narrow down the exact
> > cause.
> > On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <[email protected]>
> > wrote:
> >
> >> Sorry to contradict you, but Cloudant deploys clusters across amazon
> AZ's
> >> as standard. It's fast enough. It's cross-region that you need to avoid.
> >>
> >> B.
> >>
> >>> On 5 Dec 2017, at 09:11, Jan Lehnardt <[email protected]> wrote:
> >>>
> >>> Heya Geoff,
> >>>
> >>> a CouchDB cluster is designed to run in the same data center / with
> >> local are networking latencies. A cluster across AWS Availability Zones
> >> won’t work as you see. If you want CouchDB’s in both AZs, use regular
> >> replication and keep the clusters local to the AZ.
> >>>
> >>> Best
> >>> Jan
> >>> --
> >>>
> >>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I've spent days using trial and error to try and figure out why I am
> >>>> getting a very high CPU load on only a single node in my cluster. I'm
> >>>> hoping someone has an idea of what is going on as I'm getting stuck.
> >>>>
> >>>> Here's my configuration:
> >>>>
> >>>> 1. 2 node cluster:
> >>>>    1. Each node is located in a different AWS availability zone
> >>>>    2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
> >>>> 2. A haproxy server is load balancing traffic to the nodes using round
> >>>> robin
> >>>>
> >>>> The problem:
> >>>>
> >>>> 1. After users make changes via PouchDB, a backend runs a number of
> >>>> routines that use views to calculate notifications. The issue is that
> >> on a
> >>>> single node, the couchjs processes stack up and then start to consume
> >>>> nearly all the available CPU. This server then becomes the "workhorse"
> >> that
> >>>> always does *all* the heavy duty couchjs processing until I restart
> >> this
> >>>> node.
> >>>> 2. It is important to note that both nodes have couchjs processes, but
> >>>> it is only a single node that has the couchjs processes that are using
> >> 100%
> >>>> CPU
> >>>> 3. I've even resorted to setting `os_process_limit = 10` and this just
> >>>> results in each couchjs process taking over 10% each! In other words,
> >> the
> >>>> couchjs processes just eat up all the CPU no matter how many couchjs
> >>>> process there are!
> >>>> 4. The CPU usage will eventually clear after all the processing is
> >> done,
> >>>> but then as soon as there is more to process the workhorse node will
> >> get
> >>>> bogged down again.
> >>>> 5. If I restart the workhorse node, the other node then becomes the
> >>>> workhorse node. This is the only way to get the couchjs processes to
> >> "move"
> >>>> to another node.
> >>>> 6. The problem is that this design is not scalable as only one node
> can
> >>>> be the workhorse node at any given time. Moreover this causes specific
> >>>> instances to run out of CPU credits. Shouldn't the couchjs processes
> be
> >>>> spread out over all my nodes? From what I can tell, if I add more
> >> nodes I'm
> >>>> still going to have the issue where only one of the nodes is getting
> >> bogged
> >>>> down. Is it possible that the problem is that I have 2 nodes and
> >> really I
> >>>> need at least 3 nodes? (I know a 2-node cluster is not very typical)
> >>>>
> >>>>
> >>>> Things I've checked:
> >>>>
> >>>> 1. Ensured that the load balancing is working, i.e. haproxy is indeed
> >>>> distributing traffic accordingly
> >>>> 2. I've tried setting `os_process_limit = 10` and
> >> `os_process_soft_limit
> >>>> = 5` to see if I could force a more conservative usage of couchjs
> >>>> processes, but instead the couchjs processes just consume all the CPU
> >> load.
> >>>> 3. I've tried simulating the issue locally with VMs and I cannot
> >>>> duplicate any such load. My guess is that this is because the nodes
> are
> >>>> located on the same box so hop distance between nodes is very small
> and
> >>>> this somehow keeps the CPU usage to a minimum
> >>>> 4. I've tried isolating the issue by creating short code snippets that
> >>>> intentionally try to spawn a lot of couchjs processes and they are
> >> spawned
> >>>> but don't consume 100% CPU
> >>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
> >>>> doesn't seem to change anything
> >>>> 6. The only error entries in my CouchDB logs are like the following
> and
> >>>> I don't believe they are related to my issue:
> >>>>    1.
> >>>>
> >>>>    [error] 2017-12-04T18:13:38.728970Z [email protected]
> >> <0.13974.79>
> >>>>    4b0b21c664 rexi_server: from: [email protected](<0.20638.79>)
> >> mfa:
> >>>>    fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to
> >> access
> >>>>    this db.">>}
> >>>>
> >>
> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> >>>>
> >>>> Does CouchDB have some logic built in that spawns a number of couchjs
> >>>> processes on a "primary" node? Will future view processing then always
> >> be
> >>>> routed to this "primary" node?
> >>>>
> >>>> Is there a way to better distribute these heavy duty couchjs
> processes?
> >> Is
> >>>> it possible to limit their CPU consumption? (I'm hesitant to start
> down
> >> the
> >>>> path of using something like cpulimit as I think there is a root
> problem
> >>>> that needs to be addressed)
> >>>>
> >>>> I'm running out of ideas and hope that someone has some notion of what
> >> is
> >>>> causing this bizarre load or if there is a bug in CouchDB.
> >>>>
> >>>> Thank you for any help you can provide!
> >>>>
> >>>> Geoff
> >>>
> >>> --
> >>> Professional Support for Apache CouchDB:
> >>> https://neighbourhood.ie/couchdb-support/
> >>>
> >>
> >>
>
>

views.json
Description: application/json

Re: 100% CPU on only a single node because of couchjs processes

Reply via email to