On Mon, Oct 19, 2009 at 9:48 AM, Simon Eisenmann <[email protected]> wrote: > Hi Paul, > > thanks for your feedback! > > Am Montag, den 19.10.2009, 09:40 -0400 schrieb Paul Davis: >> Are there any tracebacks in the logs that you can paste? I don't think >> I've heard of replication getting wedged without some sort of >> feedback. > > Unfortunately there was no error in the logs or on stderr. Also any > further replication request does hang as well (never completes). The > last entry is always "recording a checkpoint at source update_seq ...". > > Please note that this is reproduceable, means it happens all the time > though the time frame varies. > >> Also, are you using continuous replication then? I do know that just >> before the 0.10.0 release that Adam Kocoloski and Robert Newson spent >> a good amount of time getting star (all nodes replicate continuously >> to all otheres) kinks ironed out. Or maybe it was a ring. I dunno, but >> there was work on something like that. > > I am not using continous replication but an update notification process > triggering pull replication on the other nodes from the database which > was changes. Your point regarding rings is interesting. In general that > would explain it. Though in case of a ring i would have multiple hanging > replications at the same time correct? It always starts with one > direction hanging. The other way around usually works just fine until it > hangs some time (hours) later. > > Also i have tested this with a couple of SVN revisions before the 10.0 > release and things improved a lot since the first tests. Though now i > have much more data database update sequence in millions range. > > Best regards > Simon > > > >> >> Paul Davis > -- > Simon Eisenmann > > [ mailto:[email protected] ] > > [ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ] > [ T. +49.711.896656.68 | F.+49.711.89665610 ] > [ http://www.struktur.de | mailto:[email protected] ] >
Simon, Hmmm, that sounds most odd. Are there any consistencies on when it hangs? Specifically, does it look like its a poison doc that causes things to go wonky or some such? Do nodes fail in a specific order? Also, you might try setting up the continuous replication instead of the update notifications as that might be a bit more ironed out. Another thing to check is if its just the task status that's wonky vs actual replication. You can check the _local doc that's created by replication to see if its update seq is changing while task statuses aren't. Paul Davis
