Re: The state of filtered replication

Stefan du Fresne Wed, 25 May 2016 03:41:08 -0700

Hi Pedro,

Thanks for your advice.


This is definitely something that is in the back of our minds, along with 
looking into couchdb clustering. Another similar option we’re considering is 
having filtered replication between those replicas and having them represent 
regions (our data permission structure is basically report <- person <- family 
<- region <- larger region <- still larger region). This would still involve 
filtered replication, but would cut down on irrelevant documents that users had 
to filter through. We’re still at the stage of trying to get the most out of 
one server however. 

On your example though, to be clear, assigning users to replicas is something 
that I have to manage myself, correct? Do you know if a particular user needs 
to stays on the same replica or if I could just dumbly direct them to any 
existing node? Naively I’d think that I could do the latter, but I’ve noticed 
one-way replication seems to involve passing some metadata back to the server 
(Pouch does this, though I’ve never really looked into what it’s sending or 
what Couch does it with.), so it’s not clear how stateful this kind of thing is.

Cheers,
Stefan

> On 25 May 2016, at 09:51, Pedro Narciso García Revington 
> <[email protected]> wrote:
> 
> Because couchdb supports master master replication you can alter your
> schema to:
> 
> master couchdb → couchdb replica 1 → some clients
>                               couchdb replica 2 → some other clients
> 
> So you can distrubute the load between the replicas.
> 
> 2016-05-25 10:34 GMT+02:00 Stefan du Fresne <[email protected] 
> <mailto:[email protected]>>:
> 
>> Hello all,
>> 
>> I work on an app that involves a large amount of CouchDB filtered
>> replication (every user has a filtered subset of the DB locally via
>> PouchDB). Currently filtered replication is our number 1 performance
>> bottleneck for rolling out to more users, and I'm trying to work out where
>> we can go from here.
>> 
>> Our current setup is one CouchDB database and N PouchDB installations,
>> which all two-way replicate, with the CouchDB->PouchDB replication being
>> filtered based on user permissions / relevance [1].
>> 
>> Our issue is that as we add users a) total document creation velocity
>> increases, and b) the proportion of documents that are relevant to any
>> particular user decreases. These two points cause replication-- both
>> initial onboarding and continual-- to take longer and longer.
>> 
>> At this stage we are being forced to manually limit the number of users we
>> onboard at any particular time to half a dozen or so, or risk CouchDB being
>> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
>> time due to how we're rolling pit, you can imagine that this is pretty
>> painful.
>> 
>> I have already re-written the filter in Erlang, which halved its execution
>> time, which is awesome!
>> 
>> I also attempted to simplify the filter to increase performance. However,
>> filter speed seems more dependent on the physical size of your filter as
>> opposed to what code executes, which makes writing a simple filter that can
>> fall-back to a complicated filter not terribly useful (see:
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <
>> https://issues.apache.org/jira/browse/COUCHDB-3021 
>> <https://issues.apache.org/jira/browse/COUCHDB-3021>>)
>> 
>> If the above linked ticket is fixed (if it can be) this would make our
>> filter 3-4x faster again. However, this still wouldn't address the
>> fundamental issue that filtered replication is very CPU-intensive, and so
>> as noted above doesn't seem to scale terribly well.
>> 
>> Ideally then, I would like to remove filter replication completely, but
>> there does not seem to be a good alternative right now.
>> 
>> Looking through the archives there was talk of adding view replication,
>> see:
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
>>  
>> <https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
>> <
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
>>  
>> <https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>>
>> , but it doesn't look like this ever got resolved.
>> 
>> There is also often talk of databases per user being a good scaling
>> strategy, but we're basically doing that already (with PouchDB),  and for
>> us documents aren't owned / viewed by just one person so this does not get
>> us away from filtered replication (eg a supervisor replicates her documents
>> as well as N sub-users documents). There are potentially wild and crazy
>> schemes that involves many different databases where the equivalent of
>> filtering is expressed in replication relationships, but this would add a
>> massive amount of complexity to our app, and I’m not even convinced it
>> would work as there are lots of edge cases to consider.
>> 
>> Does anyone know of anything else I can try to increase replication
>> performance? Or to safeguard against many replicators unacceptably
>> degrading couchdb performance? Does Couch 2.0 address any of these concerns?
>> 
>> Thanks in advance,
>> - Stefan du Fresne
>> 
>> [1] security is handled by not exposing couch and going through a wrapper
>> service that validates couch requests, relevance is hierarchy based (i.e.
>> documents you or your subordinates are authors of are replicated to you)
>> [2] there are also administrators / configurers that access couchdb
>> directly

Re: The state of filtered replication

Reply via email to