On 21 Oct 2009, at 15:08, Adam Kocoloski wrote:
On Oct 21, 2009, at 4:23 AM, Simon Eisenmann wrote:
Hi,
Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
Also, you might try setting up the continuous replication instead
of
the update notifications as that might be a bit more ironed out.
I already have considered that, though as long there is no way to
figure
out if a continous replication is still up and running i cannot use
it,
cause i have to restart it when a node fails and comes up again
later.
Hmm. Doesn't the _local doc for the continuous replication show if
its
still in progress? Oh, though it might not have a specific flag
indicating as such.
I changed the system to use continuous replication and checkin the
_local doc to make sure it's still running. That way everything works
fine and i cannot reproduce any hangs.
Though in the logs i now see lots of
[info] [<0.164.0>] A server has restarted sinced replication start.
Not
recording the new sequence number to ensure the replication is redone
and documents reexamined.
Messages. I posted this in IRC yesterday and was told that this is
nothing to worry about. So what exactly does it mean and why it is
logged with info level when it can be ignored?
If that message is nothing critical i would suggest to log it with
debug
level, as it is shown at any replication checkpoint on any node as
soon
as one of the other nodes was offline.
So, what we're trying to do here is avoid skipping updates from the
source server. Consider the following sequence of events:
1) Save some docs to the source with delayed_commits=true
2) Replicating source -> target
3) Restart source before full commit, losing the updates that have
replicated
4) Save more docs to source, overwriting previously used sequence
numbers
If that happens, we don't want the replicator to skip the new docs
that have been saved in step 4. So if we detect that a server
restarted, we play it safe and don't checkpoint, so that the next
replication will re-examine the sequence. An analogous situation
could happen with the target losing updates that the replicator had
written (but not fully committed).
Skipping checkpointing altogether for the remainder of the
replication is an overly conservative position. In my opinion what
we should do when we detect this condition is restart the
replication immediately from the last known checkpoint. Then you'd
see one of these [info] level messages telling you that the
replicator is going to restart to double-check some sequence
numbers, and that's it.
Best, Adam
Adam, this mail is great Wiki material. Can you (or anyone) find a
place for it on the wiki for future reference?
Cheers
Jan
--