Hi Matt, going to snip a bit here to keep the discussion manageable ...
On May 14, 2009, at 12:00 PM, Matt Goodall wrote:
When I tried things before writing my mail I was using two couchdb servers running from relatively recent versions of trunk. So 0.9 and a bit ;-). I didn't know about the ~10MB. I don't know if I reached that threshold which may be why it seemed to be started over each time. I'll try to retest with a lower threshold and more debugging to see what's really happening. Any help on where that hard-coded 10MB value is would be very helpful!
In line 205 of couch_rep.erl you should see
{NewBuffer, NewContext} = case couch_util:should_flush() of
should_flush() takes an argument which is a number of bytes. So changing that to
{NewBuffer, NewContext} = case couch_util:should_flush(1000) of
would cause the replicator to checkpoint after each kilobyte (on document boundaries, of course). You should see a line in the logfile on the machine initiating the replication like
"recording a checkpoint at source update_seq N"
Others have commented that the 10MB threshold really needs to beconfigurable. E.g., set it to zero and you get per-document checkpoints, but your throughput will drop and the final DB size on the target will grow.Super easy to do, but no one's gotten around to it.Presumably the threshold all depends on the quality of the network connection between the two endpoints, although having the default configurable is probably a good thing anyway.
I think a configurable default is an OK option, but what I'd really like to see is the checkpoint threshold added as an optional field to the JSON body sent in an individual POST to _replicate.
Secondly, if the network connection fails in the middle of replication(closing an ssh tunnel is a good way to test this ;-)) then it seemsto retry a few (10) times before the replicator process terminates. Ifthe network connection becomes available again (restart the ssh tunnel) the replicator doesn't seem to notice. Also, I just noticed that Futon still lists the replication on its status page.That's correct, the replicator does try to ignore transient failures.Hmm, it seemed to fail on transient failures here. After killing and restarting my ssh tunnel I left the replication a while and it never seemed to continue, and the only way to clear it from the status list was to restart the couchdb server. I'll check again though.
Ok, I misread you earlier. It's possible that CouchDB or ibrowse is trying to reuse a socket when it really should be opening a new one. That would be a bug.
If I'm correct, and I really hope I'm missing something, then couchdb's replication is probably not currently suitable for replicating anything but very small database differences over anunstable connection. Does anyone have any real experience in this sortof scenario?Personally, I do not. I think the conclusion is a bit pessimistic, though.Sorry, wasn't meaning to be pessimistic. Just trying to report honestly what I was seeing so it could be improved where possible.
Absolutely, that statement probably came off too confrontational. The more high-quality feedback like this we get the better off we'll be! Cheers,
Adam
