Re: 100% CPU on only a single node because of couchjs processes

Jan Lehnardt Tue, 05 Dec 2017 01:12:51 -0800

> On 4. Dec 2017, at 22:08, Sinan Gabel <[email protected]> wrote:
> 
> Hi,
> 
> I am also experiencing 100% CPU usage, not sure why, it happens suddenly
> and continues until couchdb is restarted.
> CouchDB version being used is also single-node (n:3, q:8) and v.
> 2.1.0-6c4def6 on Ubuntu 16.04 2 vCPU's and 4.5 GB memory.


We need a lot more info about your setup, configuration, log files etc. to 
comment.

Thanks!
Jan
--

> 
> On 4 December 2017 at 19:46, Geoffrey Cox <[email protected]> wrote:
> 
>> Hi,
>> 
>> I've spent days using trial and error to try and figure out why I am
>> getting a very high CPU load on only a single node in my cluster. I'm
>> hoping someone has an idea of what is going on as I'm getting stuck.
>> 
>> Here's my configuration:
>> 
>>   1. 2 node cluster:
>>      1. Each node is located in a different AWS availability zone
>>      2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
>>   2. A haproxy server is load balancing traffic to the nodes using round
>>   robin
>> 
>> The problem:
>> 
>>   1. After users make changes via PouchDB, a backend runs a number of
>>   routines that use views to calculate notifications. The issue is that
>> on a
>>   single node, the couchjs processes stack up and then start to consume
>>   nearly all the available CPU. This server then becomes the "workhorse"
>> that
>>   always does *all* the heavy duty couchjs processing until I restart this
>>   node.
>>   2. It is important to note that both nodes have couchjs processes, but
>>   it is only a single node that has the couchjs processes that are using
>> 100%
>>   CPU
>>   3. I've even resorted to setting `os_process_limit = 10` and this just
>>   results in each couchjs process taking over 10% each! In other words,
>> the
>>   couchjs processes just eat up all the CPU no matter how many couchjs
>>   process there are!
>>   4. The CPU usage will eventually clear after all the processing is done,
>>   but then as soon as there is more to process the workhorse node will get
>>   bogged down again.
>>   5. If I restart the workhorse node, the other node then becomes the
>>   workhorse node. This is the only way to get the couchjs processes to
>> "move"
>>   to another node.
>>   6. The problem is that this design is not scalable as only one node can
>>   be the workhorse node at any given time. Moreover this causes specific
>>   instances to run out of CPU credits. Shouldn't the couchjs processes be
>>   spread out over all my nodes? From what I can tell, if I add more nodes
>> I'm
>>   still going to have the issue where only one of the nodes is getting
>> bogged
>>   down. Is it possible that the problem is that I have 2 nodes and really
>> I
>>   need at least 3 nodes? (I know a 2-node cluster is not very typical)
>> 
>> 
>> Things I've checked:
>> 
>>   1. Ensured that the load balancing is working, i.e. haproxy is indeed
>>   distributing traffic accordingly
>>   2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit
>>   = 5` to see if I could force a more conservative usage of couchjs
>>   processes, but instead the couchjs processes just consume all the CPU
>> load.
>>   3. I've tried simulating the issue locally with VMs and I cannot
>>   duplicate any such load. My guess is that this is because the nodes are
>>   located on the same box so hop distance between nodes is very small and
>>   this somehow keeps the CPU usage to a minimum
>>   4. I've tried isolating the issue by creating short code snippets that
>>   intentionally try to spawn a lot of couchjs processes and they are
>> spawned
>>   but don't consume 100% CPU
>>   5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
>>   doesn't seem to change anything
>>   6. The only error entries in my CouchDB logs are like the following and
>>   I don't believe they are related to my issue:
>>      1.
>> 
>>      [error] 2017-12-04T18:13:38.728970Z [email protected]
>> <0.13974.79>
>>      4b0b21c664 rexi_server: from: [email protected](<0.20638.79>)
>> mfa:
>>      fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to
>> access
>>      this db.">>}
>>      [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{
>> fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{
>> line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.
>> erl"},{line,139}]}]
>> 
>> Does CouchDB have some logic built in that spawns a number of couchjs
>> processes on a "primary" node? Will future view processing then always be
>> routed to this "primary" node?
>> 
>> Is there a way to better distribute these heavy duty couchjs processes? Is
>> it possible to limit their CPU consumption? (I'm hesitant to start down the
>> path of using something like cpulimit as I think there is a root problem
>> that needs to be addressed)
>> 
>> I'm running out of ideas and hope that someone has some notion of what is
>> causing this bizarre load or if there is a bug in CouchDB.
>> 
>> Thank you for any help you can provide!
>> 
>> Geoff
>> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: 100% CPU on only a single node because of couchjs processes

Reply via email to