Hello Slony-I community,
Hoping someone can advise on a strange and serious problem. We
performed a slony service failover yesterday. For the first time ever, our
slony service FAILOVER op errored out. We recently expanded our cluster to 7
consumers from a single provider. There are no load issues during normal
operations. As the error output below shows, though, our node 4 and node 5
consumers never got the events they needed. Here’s where it gets weird: closer
inspection has shown that node 2->4 and node 2->5 path data went missing out of
the service at some point. It seems clear that’s the main issue, but in spite
of that, both node 4 and node 5 continued to find and process node 2 SYNC
events for a full week! The logs show this happened in spite of multiple
restarts.
How can this happen? If missing path data stymies the failover, wouldn’t it
also prevent normal SYNC processing?
In the case where a failover is begun with inadequate path data, what’s the
best resolution? Can path data be quickly applied to allow failover to succeed?
Thanks in advance for any insights.
---- failover error ----
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: NOTICE:
calling restart node 1
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55: 2017-06-26
18:33:02
executing preFailover(1,1) on 2
executing preFailover(1,1) on 3
executing preFailover(1,1) on 4
executing preFailover(1,1) on 5
executing preFailover(1,1) on 6
executing preFailover(1,1) on 7
executing preFailover(1,1) on 8
NOTICE: executing "_ams_cluster".failedNode2 on node 2
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 8 only on event 5000061654, node 4 only on event
5000061654, node 5 only on event 5000061655, node 3 only on event 5000061662,
node 6\
only on event 5000061654, node 7 only on event 5000061656
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061657, node 5 only on event
5000061663, node 3 only on event 5000061663, node 6 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663, node 6 only on event 5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
/tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: waiting for
event (2,5000061664). node 4 only on event 5000061663, node 5 only on event
5000061663
---- node 4 log archive ----
bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath: pa_server=2
pa_client=4|restart notification' prod4/node4-pathconfig.out
2017-06-15 15:14:00 UTC [5688] INFO localListenThread: got restart
notification
2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 pa_client=4
pa_conninfo="dbname=ams
2017-06-15 15:53:00 UTC [8431] INFO localListenThread: got restart
notification
2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2 pa_client=4
pa_conninfo="dbname=ams
2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2 pa_client=4
pa_conninfo="dbname=ams
2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 pa_client=4
pa_conninfo="dbname=ams
2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2
2017-06-19 15:11:45 UTC [2707] INFO localListenThread: got restart
notification
2017-06-20 18:40:15 UTC [31224] INFO localListenThread: got restart
notification
2017-06-21 14:31:42 UTC [6253] INFO localListenThread: got restart
notification
2017-06-21 14:35:26 UTC [32367] INFO localListenThread: got restart
notification
2017-06-26 18:21:25 UTC [9278] INFO localListenThread: got restart
notification
2017-06-26 18:33:04 UTC [28839] INFO localListenThread: got restart
notification
2017-06-26 18:33:30 UTC [1785] INFO localListenThread: got restart
notification
bos-mpt5c:odin-9353 ttignor$
---- node 5 log archive ----
bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath: pa_server=2
pa_client=5|restart notification' prod5/node5-pathconfig.out
2017-06-15 15:13:56 UTC [20700] INFO localListenThread: got restart
notification
2017-06-15 15:14:06 UTC [20374] CONFIG storePath: pa_server=2 pa_client=5
pa_conninfo="dbname=ams
2017-06-15 15:53:01 UTC [20374] INFO localListenThread: got restart
notification
2017-06-15 15:53:11 UTC [2859] CONFIG storePath: pa_server=2 pa_client=5
pa_conninfo="dbname=ams
2017-06-16 17:28:19 UTC [2859] INFO localListenThread: got restart
notification
2017-06-16 17:28:29 UTC [10753] CONFIG storePath: pa_server=2 pa_client=5
pa_conninfo="dbname=ams
2017-06-19 15:11:40 UTC [10753] CONFIG disableNode: no_id=2
2017-06-19 15:11:40 UTC [10753] INFO localListenThread: got restart
notification
2017-06-20 18:40:11 UTC [450] INFO localListenThread: got restart notification
2017-06-21 14:31:41 UTC [22300] INFO localListenThread: got restart
notification
2017-06-21 14:35:28 UTC [26777] INFO localListenThread: got restart
notification
2017-06-26 18:21:27 UTC [28366] INFO localListenThread: got restart
notification
2017-06-26 18:33:04 UTC [29345] INFO localListenThread: got restart
notification
2017-06-26 18:33:27 UTC [1299] INFO localListenThread: got restart
notification
bos-mpt5c:odin-9353 ttignor$
Tom ☺
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general