Hello, If you don't care about the details or how I have came to the conclusion I have, skip to the last paragraph to save some reading. We are running couchdb and are designing our system hardware to support around 4-6K databases per machine. These are something like 16 core, 50 gigz ram, terabyte disk arrayed machines. We are running a 3 machine cluster, Master, Failover and Backup, using a daisy-chain replication model (Master -> Failover -> Backup). We originally were hoping to use the continuous -> true replication flag with a small active_task watcher to make sure all replication tasks where always running (they do on occasion die). This worked great until we tested on 4K databases. After about 300-450 replication tasks couchdb stops responding to requests and eventually dies.
So, we decided quickly as we are close to going live to do manual replication, as in we have a daemon which manually replicates the databases instead of continuous = true. The approach is very simple and works like this: for each database in "http://host:5984/_all_dbs" if matches "_filter_function_" replicate this database However, with 4 thousand databases we have several minutes before all 4 thousand calls are made, causing replication lag. Our immediate thought is to create some kind of application level array of modified databases and constantly slurp it in from our PHP web tier. Something like: for each request to couch in web cluster insert dbname in apc cache for each database in call_out_to_all_web_tier_for_touched_dbs() replicate this database This puts a decent strain on our web tier and application logic we would prefer to avoid, during high load our web tier may be slow to respond, which is by design. My final idea, which I have yet to put a ton of time into as I just got done implementing the first solution, is adding some custom handlers to couchdb. We use couchdb-lucene, so are aware of the update hooks and external api. So I was thinking in theory, I could implement some simple [httpd_db_handlers] that just flagged databases as changed in a very simple way, then add a [httpd_global_handlers] such as _changed_dbs, that just returns the changed dbs since SEQUENCE, or even since the last call to the handler. This approach might be the most native available, but would require some diving into couchdb I haven't done, also dev work and code management outside of our application, could lead to management issues as couchdb changes as well. So my next immediate thought is: why does replication need configured per database? Per server replication would work much better for us, and could likely be implemented in couchdb for optimal efficiency in high database count configurations. Our replication design and automatic-failover has grown very complicated, I think at this point it might be worth discussion for alternatives within couchdb. I am also up for discussion how other couchdb users have implemented replication on high (if 4 thousand is really considered so) db count machines. I appreciate the time and sorry this ended up long winded, but I wanted to speak on this in detail. Kind Regards, -Chris
