Re: Replication hangs

Jan Lehnardt Wed, 21 Oct 2009 06:36:39 -0700


On 21 Oct 2009, at 15:08, Adam Kocoloski wrote:

On Oct 21, 2009, at 4:23 AM, Simon Eisenmann wrote:
Hi,

Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
Also, you might try setting up the continuous replication instead
of
the update notifications as that might be a bit more ironed out.
I already have considered that, though as long there is no way to
figure
out if a continous replication is still up and running i cannot use
it,
cause i have to restart it when a node fails and comes up again
later.
Hmm. Doesn't the _local doc for the continuous replication show ifits
still in progress? Oh, though it might not have a specific flag
indicating as such.
I changed the system to use continuous replication and checkin the
_local doc to make sure it's still running. That way everything works
fine and i cannot reproduce any hangs.

Though in the logs i now see lots of
[info] [<0.164.0>] A server has restarted sinced replication start.Not
recording the new sequence number to ensure the replication is redone
and documents reexamined.

Messages. I posted this in IRC yesterday and was told that this is
nothing to worry about. So what exactly does it mean and why it is
logged with info level when it can be ignored?
If that message is nothing critical i would suggest to log it withdebuglevel, as it is shown at any replication checkpoint on any node assoon
as one of the other nodes was offline.
So, what we're trying to do here is avoid skipping updates from thesource server. Consider the following sequence of events:
1) Save some docs to the source with delayed_commits=true
2) Replicating source -> target
3) Restart source before full commit, losing the updates that havereplicated4) Save more docs to source, overwriting previously used sequencenumbers
If that happens, we don't want the replicator to skip the new docsthat have been saved in step 4. So if we detect that a serverrestarted, we play it safe and don't checkpoint, so that the nextreplication will re-examine the sequence. An analogous situationcould happen with the target losing updates that the replicator hadwritten (but not fully committed).
Skipping checkpointing altogether for the remainder of thereplication is an overly conservative position. In my opinion whatwe should do when we detect this condition is restart thereplication immediately from the last known checkpoint. Then you'dsee one of these [info] level messages telling you that thereplicator is going to restart to double-check some sequencenumbers, and that's it.
Best, Adam

Adam, this mail is great Wiki material. Can you (or anyone) find aplace for it on the wiki for future reference?


Cheers
Jan
--

Re: Replication hangs

Reply via email to