Hello, I will describe my problem in a general way. If more details are needed, I will try to gather them from my production environments. We have several couchdb instances, with a bunch of databases. Some of these databases are connected via replication. Some of the replications are working via an ssh-tunnel, others by direct internet connection. The latency between couchdb instances ranges between few milliseconds to up de several hundreds of milliseconds.
My problem is that it is very common for the replications to stop. It could due to connectivity being lost (sometimes the ssh tunnels fail and must be recreated), but this is not the only reason. And worse: the replications are not restarted automatically. They stay in error. The problem is so frequent that I have a replication monitor process looking for erroneous replications, and deleting and recreating the replication documents of those replications in error, every 5 minutes. This is the only method I have found to reliably restart the replications. Is somebody else experiencing similar problems? Do you have any suggestion on how to make replications more robust in front of connectivity issues? Are there other methods to restart erroneous replications, apart from redefining them? Thanks, Daniel Gonzalez
