Hi Geoff, a couple of additional questions:

1) Are you making these view requests with stale=ok or stale=update_after?
2) What are you using for N and Q in the [cluster] configuration settings?
3) Did you take advantage of the (barely-documented) “zones" attribute when 
defining cluster members?
3) Do you have any other JS code besides the view definitions?

Regarding #1, the cluster will actually select shards differently depending on 
the use of those query parameters. When your request stipulates that you’re OK 
with stale results the cluster *will* select a “primary” copy in order to 
improve the consistency of repeated requests to the same view. The algorithm 
for choosing those primary copies is somewhat subtle hence my question #3.

If you’re not using stale requests I have a much harder time explaining why the 
100% CPU issue would migrate from node to node like that.

Adam

> On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <[email protected]> wrote:
> 
> Thanks for the responses, any other thoughts?
> 
> FYI: I’m trying to work on a very focused test case that I can share with
> the Dev team, but it is taking a little while to narrow down the exact
> cause.
> On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <[email protected]>
> wrote:
> 
>> Sorry to contradict you, but Cloudant deploys clusters across amazon AZ's
>> as standard. It's fast enough. It's cross-region that you need to avoid.
>> 
>> B.
>> 
>>> On 5 Dec 2017, at 09:11, Jan Lehnardt <[email protected]> wrote:
>>> 
>>> Heya Geoff,
>>> 
>>> a CouchDB cluster is designed to run in the same data center / with
>> local are networking latencies. A cluster across AWS Availability Zones
>> won’t work as you see. If you want CouchDB’s in both AZs, use regular
>> replication and keep the clusters local to the AZ.
>>> 
>>> Best
>>> Jan
>>> --
>>> 
>>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <[email protected]> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I've spent days using trial and error to try and figure out why I am
>>>> getting a very high CPU load on only a single node in my cluster. I'm
>>>> hoping someone has an idea of what is going on as I'm getting stuck.
>>>> 
>>>> Here's my configuration:
>>>> 
>>>> 1. 2 node cluster:
>>>>    1. Each node is located in a different AWS availability zone
>>>>    2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
>>>> 2. A haproxy server is load balancing traffic to the nodes using round
>>>> robin
>>>> 
>>>> The problem:
>>>> 
>>>> 1. After users make changes via PouchDB, a backend runs a number of
>>>> routines that use views to calculate notifications. The issue is that
>> on a
>>>> single node, the couchjs processes stack up and then start to consume
>>>> nearly all the available CPU. This server then becomes the "workhorse"
>> that
>>>> always does *all* the heavy duty couchjs processing until I restart
>> this
>>>> node.
>>>> 2. It is important to note that both nodes have couchjs processes, but
>>>> it is only a single node that has the couchjs processes that are using
>> 100%
>>>> CPU
>>>> 3. I've even resorted to setting `os_process_limit = 10` and this just
>>>> results in each couchjs process taking over 10% each! In other words,
>> the
>>>> couchjs processes just eat up all the CPU no matter how many couchjs
>>>> process there are!
>>>> 4. The CPU usage will eventually clear after all the processing is
>> done,
>>>> but then as soon as there is more to process the workhorse node will
>> get
>>>> bogged down again.
>>>> 5. If I restart the workhorse node, the other node then becomes the
>>>> workhorse node. This is the only way to get the couchjs processes to
>> "move"
>>>> to another node.
>>>> 6. The problem is that this design is not scalable as only one node can
>>>> be the workhorse node at any given time. Moreover this causes specific
>>>> instances to run out of CPU credits. Shouldn't the couchjs processes be
>>>> spread out over all my nodes? From what I can tell, if I add more
>> nodes I'm
>>>> still going to have the issue where only one of the nodes is getting
>> bogged
>>>> down. Is it possible that the problem is that I have 2 nodes and
>> really I
>>>> need at least 3 nodes? (I know a 2-node cluster is not very typical)
>>>> 
>>>> 
>>>> Things I've checked:
>>>> 
>>>> 1. Ensured that the load balancing is working, i.e. haproxy is indeed
>>>> distributing traffic accordingly
>>>> 2. I've tried setting `os_process_limit = 10` and
>> `os_process_soft_limit
>>>> = 5` to see if I could force a more conservative usage of couchjs
>>>> processes, but instead the couchjs processes just consume all the CPU
>> load.
>>>> 3. I've tried simulating the issue locally with VMs and I cannot
>>>> duplicate any such load. My guess is that this is because the nodes are
>>>> located on the same box so hop distance between nodes is very small and
>>>> this somehow keeps the CPU usage to a minimum
>>>> 4. I've tried isolating the issue by creating short code snippets that
>>>> intentionally try to spawn a lot of couchjs processes and they are
>> spawned
>>>> but don't consume 100% CPU
>>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
>>>> doesn't seem to change anything
>>>> 6. The only error entries in my CouchDB logs are like the following and
>>>> I don't believe they are related to my issue:
>>>>    1.
>>>> 
>>>>    [error] 2017-12-04T18:13:38.728970Z [email protected]
>> <0.13974.79>
>>>>    4b0b21c664 rexi_server: from: [email protected](<0.20638.79>)
>> mfa:
>>>>    fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to
>> access
>>>>    this db.">>}
>>>> 
>> [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
>>>> 
>>>> Does CouchDB have some logic built in that spawns a number of couchjs
>>>> processes on a "primary" node? Will future view processing then always
>> be
>>>> routed to this "primary" node?
>>>> 
>>>> Is there a way to better distribute these heavy duty couchjs processes?
>> Is
>>>> it possible to limit their CPU consumption? (I'm hesitant to start down
>> the
>>>> path of using something like cpulimit as I think there is a root problem
>>>> that needs to be addressed)
>>>> 
>>>> I'm running out of ideas and hope that someone has some notion of what
>> is
>>>> causing this bizarre load or if there is a bug in CouchDB.
>>>> 
>>>> Thank you for any help you can provide!
>>>> 
>>>> Geoff
>>> 
>>> --
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>> 
>> 

Reply via email to