The state of filtered replication

Stefan du Fresne Wed, 25 May 2016 01:35:44 -0700

Hello all,

I work on an app that involves a large amount of CouchDB filtered replication 
(every user has a filtered subset of the DB locally via PouchDB). Currently 
filtered replication is our number 1 performance bottleneck for rolling out to 
more users, and I'm trying to work out where we can go from here.


Our current setup is one CouchDB database and N PouchDB installations, which 
all two-way replicate, with the CouchDB->PouchDB replication being filtered 
based on user permissions / relevance [1].

Our issue is that as we add users a) total document creation velocity 
increases, and b) the proportion of documents that are relevant to any 
particular user decreases. These two points cause replication-- both initial 
onboarding and continual-- to take longer and longer.

At this stage we are being forced to manually limit the number of users we 
onboard at any particular time to half a dozen or so, or risk CouchDB being 
unresponsive [2]. As we'd want to be onboarding 50-100 at any particular time 
due to how we're rolling pit, you can imagine that this is pretty painful.

I have already re-written the filter in Erlang, which halved its execution 
time, which is awesome!

I also attempted to simplify the filter to increase performance. However, 
filter speed seems more dependent on the physical size of your filter as 
opposed to what code executes, which makes writing a simple filter that can 
fall-back to a complicated filter not terribly useful (see: 
https://issues.apache.org/jira/browse/COUCHDB-3021 
<https://issues.apache.org/jira/browse/COUCHDB-3021>)

If the above linked ticket is fixed (if it can be) this would make our filter 
3-4x faster again. However, this still wouldn't address the fundamental issue 
that filtered replication is very CPU-intensive, and so as noted above doesn't 
seem to scale terribly well.

Ideally then, I would like to remove filter replication completely, but there 
does not seem to be a good alternative right now.

Looking through the archives there was talk of adding view replication, see: 
https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
 
<https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
 , but it doesn't look like this ever got resolved.

There is also often talk of databases per user being a good scaling strategy, 
but we're basically doing that already (with PouchDB),  and for us documents 
aren't owned / viewed by just one person so this does not get us away from 
filtered replication (eg a supervisor replicates her documents as well as N 
sub-users documents). There are potentially wild and crazy schemes that 
involves many different databases where the equivalent of filtering is 
expressed in replication relationships, but this would add a massive amount of 
complexity to our app, and I’m not even convinced it would work as there are 
lots of edge cases to consider.

Does anyone know of anything else I can try to increase replication 
performance? Or to safeguard against many replicators unacceptably degrading 
couchdb performance? Does Couch 2.0 address any of these concerns?

Thanks in advance,
- Stefan du Fresne

[1] security is handled by not exposing couch and going through a wrapper 
service that validates couch requests, relevance is hierarchy based (i.e. 
documents you or your subordinates are authors of are replicated to you)
[2] there are also administrators / configurers that access couchdb directly

The state of filtered replication

Reply via email to