Thanks for the responses, any other thoughts? FYI: I’m trying to work on a very focused test case that I can share with the Dev team, but it is taking a little while to narrow down the exact cause. On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <[email protected]> wrote:
> Sorry to contradict you, but Cloudant deploys clusters across amazon AZ's > as standard. It's fast enough. It's cross-region that you need to avoid. > > B. > > > On 5 Dec 2017, at 09:11, Jan Lehnardt <[email protected]> wrote: > > > > Heya Geoff, > > > > a CouchDB cluster is designed to run in the same data center / with > local are networking latencies. A cluster across AWS Availability Zones > won’t work as you see. If you want CouchDB’s in both AZs, use regular > replication and keep the clusters local to the AZ. > > > > Best > > Jan > > -- > > > >> On 4. Dec 2017, at 19:46, Geoffrey Cox <[email protected]> wrote: > >> > >> Hi, > >> > >> I've spent days using trial and error to try and figure out why I am > >> getting a very high CPU load on only a single node in my cluster. I'm > >> hoping someone has an idea of what is going on as I'm getting stuck. > >> > >> Here's my configuration: > >> > >> 1. 2 node cluster: > >> 1. Each node is located in a different AWS availability zone > >> 2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem) > >> 2. A haproxy server is load balancing traffic to the nodes using round > >> robin > >> > >> The problem: > >> > >> 1. After users make changes via PouchDB, a backend runs a number of > >> routines that use views to calculate notifications. The issue is that > on a > >> single node, the couchjs processes stack up and then start to consume > >> nearly all the available CPU. This server then becomes the "workhorse" > that > >> always does *all* the heavy duty couchjs processing until I restart > this > >> node. > >> 2. It is important to note that both nodes have couchjs processes, but > >> it is only a single node that has the couchjs processes that are using > 100% > >> CPU > >> 3. I've even resorted to setting `os_process_limit = 10` and this just > >> results in each couchjs process taking over 10% each! In other words, > the > >> couchjs processes just eat up all the CPU no matter how many couchjs > >> process there are! > >> 4. The CPU usage will eventually clear after all the processing is > done, > >> but then as soon as there is more to process the workhorse node will > get > >> bogged down again. > >> 5. If I restart the workhorse node, the other node then becomes the > >> workhorse node. This is the only way to get the couchjs processes to > "move" > >> to another node. > >> 6. The problem is that this design is not scalable as only one node can > >> be the workhorse node at any given time. Moreover this causes specific > >> instances to run out of CPU credits. Shouldn't the couchjs processes be > >> spread out over all my nodes? From what I can tell, if I add more > nodes I'm > >> still going to have the issue where only one of the nodes is getting > bogged > >> down. Is it possible that the problem is that I have 2 nodes and > really I > >> need at least 3 nodes? (I know a 2-node cluster is not very typical) > >> > >> > >> Things I've checked: > >> > >> 1. Ensured that the load balancing is working, i.e. haproxy is indeed > >> distributing traffic accordingly > >> 2. I've tried setting `os_process_limit = 10` and > `os_process_soft_limit > >> = 5` to see if I could force a more conservative usage of couchjs > >> processes, but instead the couchjs processes just consume all the CPU > load. > >> 3. I've tried simulating the issue locally with VMs and I cannot > >> duplicate any such load. My guess is that this is because the nodes are > >> located on the same box so hop distance between nodes is very small and > >> this somehow keeps the CPU usage to a minimum > >> 4. I've tried isolating the issue by creating short code snippets that > >> intentionally try to spawn a lot of couchjs processes and they are > spawned > >> but don't consume 100% CPU > >> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this > >> doesn't seem to change anything > >> 6. The only error entries in my CouchDB logs are like the following and > >> I don't believe they are related to my issue: > >> 1. > >> > >> [error] 2017-12-04T18:13:38.728970Z [email protected] > <0.13974.79> > >> 4b0b21c664 rexi_server: from: [email protected](<0.20638.79>) > mfa: > >> fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to > access > >> this db.">>} > >> > > [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] > >> > >> Does CouchDB have some logic built in that spawns a number of couchjs > >> processes on a "primary" node? Will future view processing then always > be > >> routed to this "primary" node? > >> > >> Is there a way to better distribute these heavy duty couchjs processes? > Is > >> it possible to limit their CPU consumption? (I'm hesitant to start down > the > >> path of using something like cpulimit as I think there is a root problem > >> that needs to be addressed) > >> > >> I'm running out of ideas and hope that someone has some notion of what > is > >> causing this bizarre load or if there is a bug in CouchDB. > >> > >> Thank you for any help you can provide! > >> > >> Geoff > > > > -- > > Professional Support for Apache CouchDB: > > https://neighbourhood.ie/couchdb-support/ > > > >
