Re: Incremental replication over unreliable link -- how granular is replication restart

Adam Kocoloski Thu, 14 May 2009 09:18:23 -0700

Hi Matt, going to snip a bit here to keep the discussion manageable ...


On May 14, 2009, at 12:00 PM, Matt Goodall wrote:

When I tried things before writing my mail I was using two couchdb
servers running from relatively recent versions of trunk. So 0.9 and a
bit ;-).

I didn't know about the ~10MB. I don't know if I reached that
threshold which may be why it seemed to be started over each time.
I'll try to retest with a lower threshold and more debugging to see
what's really happening. Any help on where that hard-coded 10MB value
is would be very helpful!


In line 205 of couch_rep.erl you should see

    {NewBuffer, NewContext} = case couch_util:should_flush() of

should_flush() takes an argument which is a number of bytes. Sochanging that to

    {NewBuffer, NewContext} = case couch_util:should_flush(1000) of

would cause the replicator to checkpoint after each kilobyte (ondocument boundaries, of course). You should see a line in the logfileon the machine initiating the replication like


"recording a checkpoint at source update_seq N"

Others have commented that the 10MB threshold really needs to be
configurable. E.g., set it to zero and you get per-documentcheckpoints,but your throughput will drop and the final DB size on the targetwill grow.
 Super easy to do, but no one's gotten around to it.
Presumably the threshold all depends on the quality of the network
connection between the two endpoints, although having the default
configurable is probably a good thing anyway.

I think a configurable default is an OK option, but what I'd reallylike to see is the checkpoint threshold added as an optional field tothe JSON body sent in an individual POST to _replicate.

Secondly, if the network connection fails in the middle ofreplication
(closing an ssh tunnel is a good way to test this ;-)) then it seems
to retry a few (10) times before the replicator processterminates. If
the network connection becomes available again (restart the ssh
tunnel) the replicator doesn't seem to notice. Also, I just noticed
that Futon still lists the replication on its status page.


That's correct, the replicator does try to ignore transient failures.


Hmm, it seemed to fail on transient failures here. After killing and
restarting my ssh tunnel I left the replication a while and it never
seemed to continue, and the only way to clear it from the status list
was to restart the couchdb server. I'll check again though.

Ok, I misread you earlier. It's possible that CouchDB or ibrowse istrying to reuse a socket when it really should be opening a new one.That would be a bug.

If I'm correct, and I really hope I'm missing something, then
couchdb's replication is probably not currently suitable for
replicating anything but very small database differences over an
unstable connection. Does anyone have any real experience in thissort
of scenario?
Personally, I do not. I think the conclusion is a bit pessimistic,though.


Sorry, wasn't meaning to be pessimistic. Just trying to report
honestly what I was seeing so it could be improved where possible.

Absolutely, that statement probably came off too confrontational. Themore high-quality feedback like this we get the better off we'll be!Cheers,


Adam

Re: Incremental replication over unreliable link -- how granular is replication restart

Reply via email to