My largest, ~600GB database was awful to compact. Because much of it
seldom changes, I shared that database by account, yielding about 500
databases of various sizes. With a compaction daemon that only compacts
a database when it grows, compaction is no longer a problem. However, I
appear to be suffering now when it comes to replication.
Five hundred continuous "pull" replications have the destination
database crying for mercy. Its four CPUs are continously busy (load
average ~4) and requests to the destination database occasionally time out.
The replication script starts a "pull" replication for each database,
one at a time. The replication requests start out taking about 0.3
seconds per database, but towards the end of the list each reques is
taking many seconds.
Shortly after the replication starts, before it's got past more than a
few dozen database, there is a brief flood of stack traces (or whatever
Erlang calls them) in the destination couch log. I think there are
fewer lines of error info than there are atoms in the sun, but only
just. Is there a guide that can help me know which lines of that log
you need to know?
The source database is not suffering: It's load average is < 1 and it
serves requests quickly.
Due to the number of databases, I've added "ulimit -n 32768" to the
startup script.
We're running version 1.2.0ac052866-git on linux 2.6.32. This version
has the new replicator.
* Are we "doing it all wrong?"
* Can I expect the storm to abate once all of the replications are
caught up?
* How can I tell which replications are "caught up?" I see that a GET
to /_active_tasks tells me that some replication tasks are "Starting"
and others have, e.g., "Processed source seq 17", but I don't know if
this is enough to know what's caught up and what's not. Do I have to
query the source database somehow to find out what source sequence is
available?
Best Regards,
Wayne Conrad