Hi Matt, going to snip a bit here to keep the discussion manageable ...

On May 14, 2009, at 12:00 PM, Matt Goodall wrote:

When I tried things before writing my mail I was using two couchdb
servers running from relatively recent versions of trunk. So 0.9 and a
bit ;-).

I didn't know about the ~10MB. I don't know if I reached that
threshold which may be why it seemed to be started over each time.
I'll try to retest with a lower threshold and more debugging to see
what's really happening. Any help on where that hard-coded 10MB value
is would be very helpful!

In line 205 of couch_rep.erl you should see

    {NewBuffer, NewContext} = case couch_util:should_flush() of

should_flush() takes an argument which is a number of bytes. So changing that to

    {NewBuffer, NewContext} = case couch_util:should_flush(1000) of

would cause the replicator to checkpoint after each kilobyte (on document boundaries, of course). You should see a line in the logfile on the machine initiating the replication like

"recording a checkpoint at source update_seq N"

Others have commented that the 10MB threshold really needs to be
configurable. E.g., set it to zero and you get per-document checkpoints, but your throughput will drop and the final DB size on the target will grow.
 Super easy to do, but no one's gotten around to it.

Presumably the threshold all depends on the quality of the network
connection between the two endpoints, although having the default
configurable is probably a good thing anyway.

I think a configurable default is an OK option, but what I'd really like to see is the checkpoint threshold added as an optional field to the JSON body sent in an individual POST to _replicate.

Secondly, if the network connection fails in the middle of replication
(closing an ssh tunnel is a good way to test this ;-)) then it seems
to retry a few (10) times before the replicator process terminates. If
the network connection becomes available again (restart the ssh
tunnel) the replicator doesn't seem to notice. Also, I just noticed
that Futon still lists the replication on its status page.

That's correct, the replicator does try to ignore transient failures.

Hmm, it seemed to fail on transient failures here. After killing and
restarting my ssh tunnel I left the replication a while and it never
seemed to continue, and the only way to clear it from the status list
was to restart the couchdb server. I'll check again though.

Ok, I misread you earlier. It's possible that CouchDB or ibrowse is trying to reuse a socket when it really should be opening a new one. That would be a bug.

If I'm correct, and I really hope I'm missing something, then
couchdb's replication is probably not currently suitable for
replicating anything but very small database differences over an
unstable connection. Does anyone have any real experience in this sort
of scenario?

Personally, I do not. I think the conclusion is a bit pessimistic, though.

Sorry, wasn't meaning to be pessimistic. Just trying to report
honestly what I was seeing so it could be improved where possible.

Absolutely, that statement probably came off too confrontational. The more high-quality feedback like this we get the better off we'll be! Cheers,

Adam

Reply via email to