Interesting, I read somewhere that having a view per ddoc is more efficient. Thanks for clarifying! On Wed, Dec 6, 2017 at 1:31 AM Jan Lehnardt <[email protected]> wrote:
> > > On 5. Dec 2017, at 21:13, Geoffrey Cox <[email protected]> wrote: > > > > Hey Adam, > > > > Attached is my local.ini and the design doc with the view JS. > > > > Please see my responses below: > > > > Thanks for the help! > > > > On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <[email protected]> > wrote: > > Hi Geoff, a couple of additional questions: > > > > 1) Are you making these view requests with stale=ok or > stale=update_after? > > GC: I am not using the stale parameter > > 2) What are you using for N and Q in the [cluster] configuration > settings? > > GC: As per the attached local.ini, I specified n=2 and am using the > default q=8. > > 3) Did you take advantage of the (barely-documented) “zones" attribute > when defining cluster members? > > GC: As per the attached local.ini, I have *not* specified this option. > > 4) Do you have any other JS code besides the view definitions? > > GC: When you refer to JS code, I think you mean in terms of JS code "in" > CouchDB and if that is the case then my only JS code is very simple views > like those in the attached view.json. (I know that I really need to break > out the views so that there is one view per doc, but I haven't quite gotten > around to refactoring this and I don't believe this is causing the CPU > usage) > > Quick comment on one or multiple view(s)-per-ddoc: this is a performance > trade-off and not either one is always correct. But generally, I would > recommend grouping all views an app would need into a single ddoc. > > For each ddoc, all docs in a database have to be serialised and shipped to > couchjs and the results are shipped back, that’s the bulk of the work in > view indexing. Evaluating a single map/reduce function is comparatively > minuscule, so grouping views in a single ddoc makes that more efficient. > > > > > > > Regarding #1, the cluster will actually select shards differently > depending on the use of those query parameters. When your request > stipulates that you’re OK with stale results the cluster *will* select a > “primary” copy in order to improve the consistency of repeated requests to > the same view. The algorithm for choosing those primary copies is somewhat > subtle hence my question #3. > > > > If you’re not using stale requests I have a much harder time explaining > why the 100% CPU issue would migrate from node to node like that. > > > > Adam > > > > > On Dec 5, 2017, at 9:36 AM, Geoffrey Cox <[email protected]> wrote: > > > > > > Thanks for the responses, any other thoughts? > > > > > > FYI: I’m trying to work on a very focused test case that I can share > with > > > the Dev team, but it is taking a little while to narrow down the exact > > > cause. > > > On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson < > [email protected]> > > > wrote: > > > > > >> Sorry to contradict you, but Cloudant deploys clusters across amazon > AZ's > > >> as standard. It's fast enough. It's cross-region that you need to > avoid. > > >> > > >> B. > > >> > > >>> On 5 Dec 2017, at 09:11, Jan Lehnardt <[email protected]> wrote: > > >>> > > >>> Heya Geoff, > > >>> > > >>> a CouchDB cluster is designed to run in the same data center / with > > >> local are networking latencies. A cluster across AWS Availability > Zones > > >> won’t work as you see. If you want CouchDB’s in both AZs, use regular > > >> replication and keep the clusters local to the AZ. > > >>> > > >>> Best > > >>> Jan > > >>> -- > > >>> > > >>>> On 4. Dec 2017, at 19:46, Geoffrey Cox <[email protected]> wrote: > > >>>> > > >>>> Hi, > > >>>> > > >>>> I've spent days using trial and error to try and figure out why I am > > >>>> getting a very high CPU load on only a single node in my cluster. > I'm > > >>>> hoping someone has an idea of what is going on as I'm getting stuck. > > >>>> > > >>>> Here's my configuration: > > >>>> > > >>>> 1. 2 node cluster: > > >>>> 1. Each node is located in a different AWS availability zone > > >>>> 2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem) > > >>>> 2. A haproxy server is load balancing traffic to the nodes using > round > > >>>> robin > > >>>> > > >>>> The problem: > > >>>> > > >>>> 1. After users make changes via PouchDB, a backend runs a number of > > >>>> routines that use views to calculate notifications. The issue is > that > > >> on a > > >>>> single node, the couchjs processes stack up and then start to > consume > > >>>> nearly all the available CPU. This server then becomes the > "workhorse" > > >> that > > >>>> always does *all* the heavy duty couchjs processing until I restart > > >> this > > >>>> node. > > >>>> 2. It is important to note that both nodes have couchjs processes, > but > > >>>> it is only a single node that has the couchjs processes that are > using > > >> 100% > > >>>> CPU > > >>>> 3. I've even resorted to setting `os_process_limit = 10` and this > just > > >>>> results in each couchjs process taking over 10% each! In other > words, > > >> the > > >>>> couchjs processes just eat up all the CPU no matter how many couchjs > > >>>> process there are! > > >>>> 4. The CPU usage will eventually clear after all the processing is > > >> done, > > >>>> but then as soon as there is more to process the workhorse node will > > >> get > > >>>> bogged down again. > > >>>> 5. If I restart the workhorse node, the other node then becomes the > > >>>> workhorse node. This is the only way to get the couchjs processes to > > >> "move" > > >>>> to another node. > > >>>> 6. The problem is that this design is not scalable as only one node > can > > >>>> be the workhorse node at any given time. Moreover this causes > specific > > >>>> instances to run out of CPU credits. Shouldn't the couchjs > processes be > > >>>> spread out over all my nodes? From what I can tell, if I add more > > >> nodes I'm > > >>>> still going to have the issue where only one of the nodes is getting > > >> bogged > > >>>> down. Is it possible that the problem is that I have 2 nodes and > > >> really I > > >>>> need at least 3 nodes? (I know a 2-node cluster is not very typical) > > >>>> > > >>>> > > >>>> Things I've checked: > > >>>> > > >>>> 1. Ensured that the load balancing is working, i.e. haproxy is > indeed > > >>>> distributing traffic accordingly > > >>>> 2. I've tried setting `os_process_limit = 10` and > > >> `os_process_soft_limit > > >>>> = 5` to see if I could force a more conservative usage of couchjs > > >>>> processes, but instead the couchjs processes just consume all the > CPU > > >> load. > > >>>> 3. I've tried simulating the issue locally with VMs and I cannot > > >>>> duplicate any such load. My guess is that this is because the nodes > are > > >>>> located on the same box so hop distance between nodes is very small > and > > >>>> this somehow keeps the CPU usage to a minimum > > >>>> 4. I've tried isolating the issue by creating short code snippets > that > > >>>> intentionally try to spawn a lot of couchjs processes and they are > > >> spawned > > >>>> but don't consume 100% CPU > > >>>> 5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and > this > > >>>> doesn't seem to change anything > > >>>> 6. The only error entries in my CouchDB logs are like the following > and > > >>>> I don't believe they are related to my issue: > > >>>> 1. > > >>>> > > >>>> [error] 2017-12-04T18:13:38.728970Z [email protected] > > >> <0.13974.79> > > >>>> 4b0b21c664 rexi_server: from: [email protected](<0.20638.79>) > > >> mfa: > > >>>> fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed > to > > >> access > > >>>> this db.">>} > > >>>> > > >> > [{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] > > >>>> > > >>>> Does CouchDB have some logic built in that spawns a number of > couchjs > > >>>> processes on a "primary" node? Will future view processing then > always > > >> be > > >>>> routed to this "primary" node? > > >>>> > > >>>> Is there a way to better distribute these heavy duty couchjs > processes? > > >> Is > > >>>> it possible to limit their CPU consumption? (I'm hesitant to start > down > > >> the > > >>>> path of using something like cpulimit as I think there is a root > problem > > >>>> that needs to be addressed) > > >>>> > > >>>> I'm running out of ideas and hope that someone has some notion of > what > > >> is > > >>>> causing this bizarre load or if there is a bug in CouchDB. > > >>>> > > >>>> Thank you for any help you can provide! > > >>>> > > >>>> Geoff > > >>> > > >>> -- > > >>> Professional Support for Apache CouchDB: > > >>> https://neighbourhood.ie/couchdb-support/ > > >>> > > >> > > >> > > > > <views.json> > > -- > Professional Support for Apache CouchDB: > https://neighbourhood.ie/couchdb-support/ > >
