Re: 100% CPU on only a single node because of couchjs processes

Geoffrey Cox Tue, 05 Dec 2017 06:37:12 -0800

Thanks for the responses, any other thoughts?

FYI: I’m trying to work on a very focused test case that I can share with
the Dev team, but it is taking a little while to narrow down the exact
cause.
On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <[email protected]>
wrote:


> Sorry to contradict you, but Cloudant deploys clusters across amazon AZ's
> as standard. It's fast enough. It's cross-region that you need to avoid.
>
> B.
>
> > On 5 Dec 2017, at 09:11, Jan Lehnardt <[email protected]> wrote:
> >
> > Heya Geoff,
> >
> > a CouchDB cluster is designed to run in the same data center / with
> local are networking latencies. A cluster across AWS Availability Zones
> won’t work as you see. If you want CouchDB’s in both AZs, use regular
> replication and keep the clusters local to the AZ.
> >
> > Best
> > Jan
> > --
> >
> >> On 4. Dec 2017, at 19:46, Geoffrey Cox <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> I've spent days using trial and error to try and figure out why I am
> >> getting a very high CPU load on only a single node in my cluster. I'm
> >> hoping someone has an idea of what is going on as I'm getting stuck.
> >>
> >> Here's my configuration:
> >>
> >>  1. 2 node cluster:
> >>     1. Each node is located in a different AWS availability zone
> >>     2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
> >>  2. A haproxy server is load balancing traffic to the nodes using round
> >>  robin
> >>
> >> The problem:
> >>
> >>  1. After users make changes via PouchDB, a backend runs a number of
> >>  routines that use views to calculate notifications. The issue is that
> on a
> >>  single node, the couchjs processes stack up and then start to consume
> >>  nearly all the available CPU. This server then becomes the "workhorse"
> that
> >>  always does *all* the heavy duty couchjs processing until I restart
> this
> >>  node.
> >>  2. It is important to note that both nodes have couchjs processes, but
> >>  it is only a single node that has the couchjs processes that are using
> 100%
> >>  CPU
> >>  3. I've even resorted to setting `os_process_limit = 10` and this just
> >>  results in each couchjs process taking over 10% each! In other words,
> the
> >>  couchjs processes just eat up all the CPU no matter how many couchjs
> >>  process there are!
> >>  4. The CPU usage will eventually clear after all the processing is
> done,
> >>  but then as soon as there is more to process the workhorse node will
> get
> >>  bogged down again.
> >>  5. If I restart the workhorse node, the other node then becomes the
> >>  workhorse node. This is the only way to get the couchjs processes to
> "move"
> >>  to another node.
> >>  6. The problem is that this design is not scalable as only one node can
> >>  be the workhorse node at any given time. Moreover this causes specific
> >>  instances to run out of CPU credits. Shouldn't the couchjs processes be
> >>  spread out over all my nodes? From what I can tell, if I add more
> nodes I'm
> >>  still going to have the issue where only one of the nodes is getting
> bogged
> >>  down. Is it possible that the problem is that I have 2 nodes and
> really I
> >>  need at least 3 nodes? (I know a 2-node cluster is not very typical)
> >>
> >>
> >> Things I've checked:
> >>
> >>  1. Ensured that the load balancing is working, i.e. haproxy is indeed
> >>  distributing traffic accordingly
> >>  2. I've tried setting `os_process_limit = 10` and
> `os_process_soft_limit
> >>  = 5` to see if I could force a more conservative usage of couchjs
> >>  processes, but instead the couchjs processes just consume all the CPU
> load.
> >>  3. I've tried simulating the issue locally with VMs and I cannot
> >>  duplicate any such load. My guess is that this is because the nodes are
> >>  located on the same box so hop distance between nodes is very small and
> >>  this somehow keeps the CPU usage to a minimum
> >>  4. I've tried isolating the issue by creating short code snippets that
> >>  intentionally try to spawn a lot of couchjs processes and they are
> spawned
> >>  but don't consume 100% CPU
> >>  5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
> >>  doesn't seem to change anything
> >>  6. The only error entries in my CouchDB logs are like the following and
> >>  I don't believe they are related to my issue:
> >>     1.
> >>
> >>     [error] 2017-12-04T18:13:38.728970Z [email protected]
> <0.13974.79>
> >>     4b0b21c664 rexi_server: from: [email protected](<0.20638.79>)
> mfa:
> >>     fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to
> access
> >>     this db.">>}
> >>
>  
> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
> >>
> >> Does CouchDB have some logic built in that spawns a number of couchjs
> >> processes on a "primary" node? Will future view processing then always
> be
> >> routed to this "primary" node?
> >>
> >> Is there a way to better distribute these heavy duty couchjs processes?
> Is
> >> it possible to limit their CPU consumption? (I'm hesitant to start down
> the
> >> path of using something like cpulimit as I think there is a root problem
> >> that needs to be addressed)
> >>
> >> I'm running out of ideas and hope that someone has some notion of what
> is
> >> causing this bizarre load or if there is a bug in CouchDB.
> >>
> >> Thank you for any help you can provide!
> >>
> >> Geoff
> >
> > --
> > Professional Support for Apache CouchDB:
> > https://neighbourhood.ie/couchdb-support/
> >
>
>

Re: 100% CPU on only a single node because of couchjs processes

Reply via email to