Re: 100% CPU on only a single node because of couchjs processes

Jan Lehnardt Tue, 05 Dec 2017 07:03:59 -0800

Oops, sorry, I got AZs and regions mixed up!

Best
Jan
--


> On 5. Dec 2017, at 13:43, Robert Samuel Newson <[email protected]> wrote:
> 
> Sorry to contradict you, but Cloudant deploys clusters across amazon AZ's as 
> standard. It's fast enough. It's cross-region that you need to avoid.
> 
> B.
> 
>> On 5 Dec 2017, at 09:11, Jan Lehnardt <[email protected]> wrote:
>> 
>> Heya Geoff,
>> 
>> a CouchDB cluster is designed to run in the same data center / with local 
>> are networking latencies. A cluster across AWS Availability Zones won’t work 
>> as you see. If you want CouchDB’s in both AZs, use regular replication and 
>> keep the clusters local to the AZ.
>> 
>> Best
>> Jan
>> --
>> 
>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I've spent days using trial and error to try and figure out why I am
>>> getting a very high CPU load on only a single node in my cluster. I'm
>>> hoping someone has an idea of what is going on as I'm getting stuck.
>>> 
>>> Here's my configuration:
>>> 
>>> 1. 2 node cluster:
>>>    1. Each node is located in a different AWS availability zone
>>>    2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
>>> 2. A haproxy server is load balancing traffic to the nodes using round
>>> robin
>>> 
>>> The problem:
>>> 
>>> 1. After users make changes via PouchDB, a backend runs a number of
>>> routines that use views to calculate notifications. The issue is that on a
>>> single node, the couchjs processes stack up and then start to consume
>>> nearly all the available CPU. This server then becomes the "workhorse" that
>>> always does *all* the heavy duty couchjs processing until I restart this
>>> node.
>>> 2. It is important to note that both nodes have couchjs processes, but
>>> it is only a single node that has the couchjs processes that are using 100%
>>> CPU
>>> 3. I've even resorted to setting `os_process_limit = 10` and this just
>>> results in each couchjs process taking over 10% each! In other words, the
>>> couchjs processes just eat up all the CPU no matter how many couchjs
>>> process there are!
>>> 4. The CPU usage will eventually clear after all the processing is done,
>>> but then as soon as there is more to process the workhorse node will get
>>> bogged down again.
>>> 5. If I restart the workhorse node, the other node then becomes the
>>> workhorse node. This is the only way to get the couchjs processes to "move"
>>> to another node.
>>> 6. The problem is that this design is not scalable as only one node can
>>> be the workhorse node at any given time. Moreover this causes specific
>>> instances to run out of CPU credits. Shouldn't the couchjs processes be
>>> spread out over all my nodes? From what I can tell, if I add more nodes I'm
>>> still going to have the issue where only one of the nodes is getting bogged
>>> down. Is it possible that the problem is that I have 2 nodes and really I
>>> need at least 3 nodes? (I know a 2-node cluster is not very typical)
>>> 
>>> 
>>> Things I've checked:
>>> 
>>> 1. Ensured that the load balancing is working, i.e. haproxy is indeed
>>> distributing traffic accordingly
>>> 2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit
>>> = 5` to see if I could force a more conservative usage of couchjs
>>> processes, but instead the couchjs processes just consume all the CPU load.
>>> 3. I've tried simulating the issue locally with VMs and I cannot
>>> duplicate any such load. My guess is that this is because the nodes are
>>> located on the same box so hop distance between nodes is very small and
>>> this somehow keeps the CPU usage to a minimum
>>> 4. I've tried isolating the issue by creating short code snippets that
>>> intentionally try to spawn a lot of couchjs processes and they are spawned
>>> but don't consume 100% CPU
>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
>>> doesn't seem to change anything
>>> 6. The only error entries in my CouchDB logs are like the following and
>>> I don't believe they are related to my issue:
>>>    1.
>>> 
>>>    [error] 2017-12-04T18:13:38.728970Z [email protected] <0.13974.79>
>>>    4b0b21c664 rexi_server: from: [email protected](<0.20638.79>) mfa:
>>>    fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to access
>>>    this db.">>}
>>>    
>>> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>>> 
>>> Does CouchDB have some logic built in that spawns a number of couchjs
>>> processes on a "primary" node? Will future view processing then always be
>>> routed to this "primary" node?
>>> 
>>> Is there a way to better distribute these heavy duty couchjs processes? Is
>>> it possible to limit their CPU consumption? (I'm hesitant to start down the
>>> path of using something like cpulimit as I think there is a root problem
>>> that needs to be addressed)
>>> 
>>> I'm running out of ideas and hope that someone has some notion of what is
>>> causing this bizarre load or if there is a bug in CouchDB.
>>> 
>>> Thank you for any help you can provide!
>>> 
>>> Geoff
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/
>> 
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: 100% CPU on only a single node because of couchjs processes

Reply via email to