Hi Bryan, I briefly looked through the provided logs. Based on information reported in the logs the "node1" (which was in MASTER role) sent the replicated data to both node2 and node3 and ended up the replication on VLSN 7,024.
2020-01-24 10:59:59,554 DEBUG [ServiceDispatcherPool_dixon01(-1)] (c.s.j.r.i.n.MasterTransfer) - dixon01 Start Master Transfer for 60000 msec, targeting: [dixon02] 2020-01-24 10:59:59,554 DEBUG [ServiceDispatcherPool_dixon01(-1)] (c.s.j.r.i.n.MasterTransfer) - dixon01 Master Transfer progress: dixon02, 7,023, phase: 2, endVLSN: 7,023 ... 2020-01-24 11:00:00,195 DEBUG [Feeder Output for dixon02] (c.s.j.r.i.n.Feeder) - dixon01 Feeder output for replica dixon02 shutdown. feeder VLSN: 7,024 currentTxnEndVLSN: 7,023 2020-01-24 11:00:00,195 DEBUG [Feeder Output for dixon03] (c.s.j.r.i.n.Feeder) - dixon01 Feeder output for replica dixon03 shutdown. feeder VLSN: 7,024 currentTxnEndVLSN: 7,023 ... 2020-01-24 11:00:00,195 DEBUG [MASTER dixon01(1)] (c.s.j.r.i.n.FeederManager) - dixon01 Feeder manager exited. CurrentTxnEnd VLSN: 7,023 Based on collected logs for "node2", it seems that "node2" received the replicated data up to VLSN 7023: 2020-01-24 11:00:00,196 DEBUG [REPLICA dixon02(2)] (c.s.j.r.i.n.Replica) - dixon02 ReplicaOutputThread soft shutdown initiated. 2020-01-24 11:00:00,196 DEBUG [ReplicaOutputThread] (c.s.j.r.i.RepImpl) - dixon02 Thread[ReplicaOutputThread,5,main]exited. Singleton acks sent:49 Grouped acks sent:4 Max pending acks:3 2020-01-24 11:00:00,196 DEBUG [REPLICA dixon02(2)] (c.s.j.r.i.n.Replica) - dixon02 Exiting inner Replica loop. 2020-01-24 11:00:00,196 DEBUG [REPLICA dixon02(2)] (c.s.j.r.i.n.Replica) - dixon02 Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms. The node resumed VLSN from 7,024 as expected and started replication feed from VLSN 7,024 to both node3 and node1: 2020-01-24 11:00:00,196 DEBUG [Feeder Input] (c.s.j.r.s.FeederReplicaHandshake) - dixon02 Feeder-replica dixon03 handshake completed. Replica dixon03 Versions JE: 7.4.5 Log: 14 Protocol: 6 Stream Log: 13 2020-01-24 11:00:00,196 DEBUG [Feeder Input for dixon03] (c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon03 syncup started. Feeder range: first=1 last=7,023 sync=7,023 txnEnd=7,023 ... 2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon03] (c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon03 start stream at VLSN: 7,024 2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon03] (c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon03 syncup ended. Elapsed time: 15ms ... 2020-01-24 11:00:00,211 DEBUG [Feeder Input] (c.s.j.r.s.FeederReplicaHandshake) - dixon02 Feeder-replica dixon01 handshake completed. Replica dixon01 Versions JE: 7.4.5 Log: 14 Protocol: 6 Stream Log: 13 2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon01] (c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon01 syncup started. Feeder range: first=1 last=7,023 sync=7,023 txnEnd=7,023 ... 2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon01] (c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon01 start stream at VLSN: 7,024 2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon01] (c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon01 syncup ended. Elapsed time: 0ms The logs above suggest that replication was successful. The node2 should have received 50 messages published into queue "app_test" on "dixonbroker". Just out of curiosity (I did not find that in the logs), what are the sync policies for master/replicas and acknowledgement policy is configured in your group? I just would like to verify that replicated data is properly synced into disk. Perhaps, you might need to use stronger syncing policy.... Kind Regards, Alex On Fri, 24 Jan 2020 at 16:30, Bryan Dixon <[email protected]> wrote: > We've been using Broker-J 7.0.2 BDB_HA successfully (on Windows servers) > for > a couple years now without issue. I was working on upgrading to Broker-J > 7.1.6 and performing some tests in our Test Broker-J cluster prior to the > upgrade and noticed that BDB_HA message replication wasn't working properly > and we've never had an issue like this before. By not working properly, > what is happening is when publishing messages to the master virtual host > node and then forcing (via the admin console) another virtual host node to > become the master, the messages I just added are not in the queue in the > new > master. And if I publish messages in the new master, those messages aren't > in the queue in the next node I make the master. > > One difference in our Test environment from our Dev and Production > environment (which all use 3 nodes/servers) is that a couple months ago we > replaced 2 nodes (2 VMs - swapped out Windows 2008 Server with Windows 2016 > Server for those 2 VMs). For that process I just changed the Virtual Host > Node to add the 2 new VMs and then removed the old nodes (old Windows2008 > VMs). Everything appeared to be fine after that change but I didn't run > the > tests I am running now so perhaps that is what is causing this issue. > > To see if this replication issue was for our existing BDB_HA virtual host > (the VH that existed prior to the Windows VM replacement discussed > previously) I created a brand new virtual host (dixonbroker) in our Test > Qpid environment and added a virtual host node on each of our VMs in Test. > I am still getting the same behavior where messages aren't replicating > correctly to each node when that node becomes master. To help with > troubleshooting I used this new dixonbroker virtual host for the tests I've > documented below. > > I've attached log files from all 3 nodes with some debugging enabled (I can > add more debugging and run the tests again if it helps). > > > Here are the steps of my test: > 1) Node1 was master. I created a new durable queue named app_test. I have > a little Java app using qpid-jms-client .31 (I tried .44 and it still > produced the same bad results). I use Spring's JmsTemplate to publish 50 > messages. This was done at 11:00 (to help see the flow in the log files). > 2) I made Node2 the master and there were no messages in app_test queue > (see > log message 2020-01-24 11:00:00,243 INFO [Queue Recoverer : app_test (vh: > dixonbroker)] (q.m.t.recovered) - [Broker] > [vh(/dixonbroker)/ms(ProvidedBDBMessageStore)] TXN-1005 : Recovered 0 > messages for queue app_test in the node2 log file). > 3) I made Node3 the master. The 50 messages DID exist in the app_test > queue. I then added 50 more messages to app_test queue (so total of 100) > while Node3 was the master. This was at 11:02. > 4) I made Node1 the master. It only had 50 messages in app_test queue - > should have had 100. > 5) I made Node2 the master. It still didn't have any messages in app_test > queue - should have had 100. > > > > qpid-node1.log > <http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node1.log> > qpid-node2.log > <http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node2.log> > qpid-node3.log > <http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node3.log> > > > > -- > Sent from: > http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
