On May 14, 2009, at 6:26 PM, Matt Goodall wrote:

Secondly, if the network connection fails in the middle of replication (closing an ssh tunnel is a good way to test this ;-)) then it seems to retry a few (10) times before the replicator process terminates. If
the network connection becomes available again (restart the ssh
tunnel) the replicator doesn't seem to notice. Also, I just noticed
that Futon still lists the replication on its status page.

That's correct, the replicator does try to ignore transient failures.

Hmm, it seemed to fail on transient failures here. After killing and
restarting my ssh tunnel I left the replication a while and it never
seemed to continue, and the only way to clear it from the status list
was to restart the couchdb server. I'll check again though.

Ok, I misread you earlier. It's possible that CouchDB or ibrowse is trying to reuse a socket when it really should be opening a new one. That would be
a bug.

This one definitely seems like a bug. Killing and restarting my SSH
tunnel basically kills the replication, I can see no sign of it
resuming.

You get this in the log ...

<snip>

and then nothing.

Worst of all is that couch still thinks the replication is running and
refuses to start another one. Currently, the only solution is to
restart the couch server :-/.

Thanks again for catching this bug, Matt. The example you showed occurs when we record a checkpoint record, but there was also a similar problem with writing attachments to disk. I've committed a very simplistic fix for the problem; the replicator should now realize that these requests are never going to complete and commit seppuku. Not the most elegant solution, perhaps, but it's certainly better than restarting the server. The error message should take one of the following forms (still working on standardizing these error messages, of course):

{"error":"replication_link_failure", "reason":"{gen_server, call ...}"}
{"error":"internal_server_error", "reason":"replication_link_failure"}
{"error":"attachment_request_failed", "reason":"failed to replicate http://..."} {"error":"attachment_request_failed", "reason":"ibrowse error on http://... : Reason"}

We'll work on a more fine-grained failure mode in the future. Best, Adam

Reply via email to