> On 4. Dec 2017, at 22:08, Sinan Gabel <[email protected]> wrote: > > Hi, > > I am also experiencing 100% CPU usage, not sure why, it happens suddenly > and continues until couchdb is restarted. > CouchDB version being used is also single-node (n:3, q:8) and v. > 2.1.0-6c4def6 on Ubuntu 16.04 2 vCPU's and 4.5 GB memory.
We need a lot more info about your setup, configuration, log files etc. to comment. Thanks! Jan -- > > On 4 December 2017 at 19:46, Geoffrey Cox <[email protected]> wrote: > >> Hi, >> >> I've spent days using trial and error to try and figure out why I am >> getting a very high CPU load on only a single node in my cluster. I'm >> hoping someone has an idea of what is going on as I'm getting stuck. >> >> Here's my configuration: >> >> 1. 2 node cluster: >> 1. Each node is located in a different AWS availability zone >> 2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem) >> 2. A haproxy server is load balancing traffic to the nodes using round >> robin >> >> The problem: >> >> 1. After users make changes via PouchDB, a backend runs a number of >> routines that use views to calculate notifications. The issue is that >> on a >> single node, the couchjs processes stack up and then start to consume >> nearly all the available CPU. This server then becomes the "workhorse" >> that >> always does *all* the heavy duty couchjs processing until I restart this >> node. >> 2. It is important to note that both nodes have couchjs processes, but >> it is only a single node that has the couchjs processes that are using >> 100% >> CPU >> 3. I've even resorted to setting `os_process_limit = 10` and this just >> results in each couchjs process taking over 10% each! In other words, >> the >> couchjs processes just eat up all the CPU no matter how many couchjs >> process there are! >> 4. The CPU usage will eventually clear after all the processing is done, >> but then as soon as there is more to process the workhorse node will get >> bogged down again. >> 5. If I restart the workhorse node, the other node then becomes the >> workhorse node. This is the only way to get the couchjs processes to >> "move" >> to another node. >> 6. The problem is that this design is not scalable as only one node can >> be the workhorse node at any given time. Moreover this causes specific >> instances to run out of CPU credits. Shouldn't the couchjs processes be >> spread out over all my nodes? From what I can tell, if I add more nodes >> I'm >> still going to have the issue where only one of the nodes is getting >> bogged >> down. Is it possible that the problem is that I have 2 nodes and really >> I >> need at least 3 nodes? (I know a 2-node cluster is not very typical) >> >> >> Things I've checked: >> >> 1. Ensured that the load balancing is working, i.e. haproxy is indeed >> distributing traffic accordingly >> 2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit >> = 5` to see if I could force a more conservative usage of couchjs >> processes, but instead the couchjs processes just consume all the CPU >> load. >> 3. I've tried simulating the issue locally with VMs and I cannot >> duplicate any such load. My guess is that this is because the nodes are >> located on the same box so hop distance between nodes is very small and >> this somehow keeps the CPU usage to a minimum >> 4. I've tried isolating the issue by creating short code snippets that >> intentionally try to spawn a lot of couchjs processes and they are >> spawned >> but don't consume 100% CPU >> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this >> doesn't seem to change anything >> 6. The only error entries in my CouchDB logs are like the following and >> I don't believe they are related to my issue: >> 1. >> >> [error] 2017-12-04T18:13:38.728970Z [email protected] >> <0.13974.79> >> 4b0b21c664 rexi_server: from: [email protected](<0.20638.79>) >> mfa: >> fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to >> access >> this db.">>} >> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{ >> fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{ >> line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server. >> erl"},{line,139}]}] >> >> Does CouchDB have some logic built in that spawns a number of couchjs >> processes on a "primary" node? Will future view processing then always be >> routed to this "primary" node? >> >> Is there a way to better distribute these heavy duty couchjs processes? Is >> it possible to limit their CPU consumption? (I'm hesitant to start down the >> path of using something like cpulimit as I think there is a root problem >> that needs to be addressed) >> >> I'm running out of ideas and hope that someone has some notion of what is >> causing this bizarre load or if there is a bug in CouchDB. >> >> Thank you for any help you can provide! >> >> Geoff >> -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/
