My largest, ~600GB database was awful to compact. Because much of it seldom changes, I shared that database by account, yielding about 500 databases of various sizes. With a compaction daemon that only compacts a database when it grows, compaction is no longer a problem. However, I appear to be suffering now when it comes to replication.

Five hundred continuous "pull" replications have the destination database crying for mercy. Its four CPUs are continously busy (load average ~4) and requests to the destination database occasionally time out.

The replication script starts a "pull" replication for each database, one at a time. The replication requests start out taking about 0.3 seconds per database, but towards the end of the list each reques is taking many seconds.

Shortly after the replication starts, before it's got past more than a few dozen database, there is a brief flood of stack traces (or whatever Erlang calls them) in the destination couch log. I think there are fewer lines of error info than there are atoms in the sun, but only just. Is there a guide that can help me know which lines of that log you need to know?

The source database is not suffering: It's load average is < 1 and it serves requests quickly.

Due to the number of databases, I've added "ulimit -n 32768" to the startup script.

We're running version 1.2.0ac052866-git on linux 2.6.32. This version has the new replicator.

* Are we "doing it all wrong?"

* Can I expect the storm to abate once all of the replications are caught up?

* How can I tell which replications are "caught up?" I see that a GET to /_active_tasks tells me that some replication tasks are "Starting" and others have, e.g., "Processed source seq 17", but I don't know if this is enough to know what's caught up and what's not. Do I have to query the source database somehow to find out what source sequence is available?

Best Regards,
Wayne Conrad

Reply via email to