Looks related to me: https://issues.apache.org/jira/browse/ARTEMIS-256
On Mon, Oct 19, 2015 at 4:04 AM, Mihkel Nõges <mihkel.no...@transferwise.com> wrote: > Basic flow of getting unresponsive failback cluster: > Have machine with Ubuntu 14.04.3 > > 1. Install libaio1, Java 1.8.0_60, maven 3.3.3, download and extract > apache-artemis-1.1.0-bin > > <http://www.eu.apache.org/dist/activemq/activemq-artemis/1.1.0/apache-artemis-1.1.0-bin.tar.gz> > in > /opt > 2. run $ mvn -Prelease install and $ mnv verify in > /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback > SUCCESS > 3. Clean data folders and starts both servers manually: > $ > cd > /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target > $ rm -R server0/data/ > $ rm -R server1/data/ > $ server0/bin/artemis-service start > Starting artemis-service > artemis-service is now running (7154) > $ server1/bin/artemis-service start > Starting artemis-service > artemis-service is now running (7180) > 4. Kill master server and wait for slave to take over > $ kill -9 7154 > > $ tail -f server1/log/artemis.log > 08:52:54,798 INFO [org.apache.activemq.artemis.core.server] AMQ221043: > Protocol module found: [artemis-stomp-protocol]. Adding protocol support > for: STOMP > 08:53:02,145 INFO [org.apache.activemq.artemis.core.server] AMQ221109: > Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] started, waiting > live to fail before it gets active > 08:53:03,582 INFO [org.apache.activemq.artemis.core.server] AMQ221024: > Backup server > ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 is > synchronized with live-server. > 08:53:03,777 INFO [org.apache.activemq.artemis.core.server] AMQ221031: > backup announced > 08:55:59,292 INFO [org.apache.activemq.artemis.core.server] AMQ221037: > ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 to > become 'live' > 08:55:59,302 WARN [org.apache.activemq.artemis.core.client] AMQ212004: > Failed to connect to server. > 08:55:59,778 INFO [org.apache.activemq.artemis.core.server] AMQ221003: > trying to deploy queue jms.queue.exampleQueue > 08:55:59,829 WARN [org.apache.activemq.artemis.core.client] AMQ212034: > There are more than one servers on the network broadcasting the same node > id. You will see this message exactly once (per node) if a node is > restarted, in which case it can be safely ignored. But if it is logged > continuously it means you really do have more than one node on the same > network active concurrently with the same node id. This could occur if you > have a backup node active at the same time as its live node. > nodeID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 > 08:55:59,836 INFO [org.apache.activemq.artemis.core.server] AMQ221007: > Server is now live > 08:55:59,869 INFO [org.apache.activemq.artemis.core.server] AMQ221020: > Started Acceptor at broker3:61617 for protocols > [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE] > 5. > > Start master again and observer the logs: > $ server0/bin/artemis-service start > Starting artemis-service > artemis-service is now running (7388) > > $ tail -f server0/log/artemis.log > 08:57:24,625 INFO [org.apache.activemq.artemis.core.server] AMQ221012: > Using AIO Journal > 08:57:24,694 INFO [org.apache.activemq.artemis.core.server] AMQ221043: > Protocol module found: [artemis-server]. Adding protocol support for: CORE > 08:57:24,702 INFO [org.apache.activemq.artemis.core.server] AMQ221043: > Protocol module found: [artemis-amqp-protocol]. Adding protocol support > for: AMQP > 08:57:24,731 INFO [org.apache.activemq.artemis.core.server] AMQ221043: > Protocol module found: [artemis-hornetq-protocol]. Adding protocol support > for: HORNETQ > 08:57:24,733 INFO [org.apache.activemq.artemis.core.server] AMQ221043: > Protocol module found: [artemis-mqtt-protocol]. Adding protocol support > for: MQTT > 08:57:24,743 INFO [org.apache.activemq.artemis.core.server] AMQ221043: > Protocol module found: [artemis-openwire-protocol]. Adding protocol support > for: OPENWIRE > 08:57:24,878 INFO [org.apache.activemq.artemis.core.server] AMQ221043: > Protocol module found: [artemis-stomp-protocol]. Adding protocol support > for: STOMP > 08:57:25,082 INFO [org.apache.activemq.artemis.core.server] AMQ221109: > Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] started, waiting > live to fail before it gets active > 08:57:27,043 INFO [org.apache.activemq.artemis.core.server] AMQ221024: > Backup server > ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 is > synchronized with live-server. > 08:57:27,948 INFO [org.apache.activemq.artemis.core.server] AMQ221031: > backup announced > 08:57:31,227 WARN [org.apache.activemq.artemis.core.client] AMQ212037: > Connection failure has been detected: AMQ119015: The connection was > disconnected because of server shutdown [code=DISCONNECTED] > 08:57:31,252 WARN [org.apache.activemq.artemis.core.client] AMQ212037: > Connection failure has been detected: AMQ119015: The connection was > disconnected because of server shutdown [code=DISCONNECTED] > 08:57:31,307 WARN [org.apache.activemq.artemis.core.client] AMQ212037: > Connection failure has been detected: AMQ119015: The connection was > disconnected because of server shutdown [code=DISCONNECTED] > 08:57:31,339 INFO [org.apache.activemq.artemis.core.server] AMQ221037: > ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 to > become 'live' > 08:57:31,360 WARN [org.apache.activemq.artemis.core.client] AMQ212004: > Failed to connect to server. > 08:57:31,413 ERROR [org.apache.activemq.artemis.core.server] AMQ224008: > Failed to store id: java.lang.IllegalStateException: Cannot find add info 1 > at > org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799) > [artemis-journal-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183) > [artemis-journal-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79) > [artemis-journal-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:1784) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:1625) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:1535) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249) > [artemis-server-1.1.0.jar:1.1.0] > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60] > 08:57:31,540 WARN [org.apache.activemq.artemis.core.server] AMQ222173: > Queue jms.queue.exampleQueue is duplicated during reload. This queue will > be renamed as jms.queue.exampleQueue-0 > 08:57:31,550 ERROR [org.apache.activemq.artemis.core.server] AMQ224000: > Failure in initialisation: java.lang.IllegalStateException: Cursor 2 had > already been created > at > org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:1631) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:1535) > [artemis-server-1.1.0.jar:1.1.0] > at > org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249) > [artemis-server-1.1.0.jar:1.1.0] > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60] > > > On 19 October 2015 at 10:31, Mihkel Nõges <mihkel.no...@transferwise.com> > wrote: > >> Hi Clebert, >> >> I do not have other code to share with you but the example code in Artemis >> 1.1.0 binary deployment package. I'm running >> org.apache.activemq.artemis.jms.example.ReplicatedFailbackExample >> >> And only commented out the serverStart and killServer calls which I am >> doing manually. >> >> I do not think I do any of the steps too fast as I tail the server log >> files in parallel and see everything is finished when I start the fail >> back. I have waited many minutes in between. >> >> Only changes in configuration to the test is changing localhost addresses >> with broker3 to make the cluster accessible remotely. >> >> BR! >> MIhkel >> >> On 18 October 2015 at 17:49, Clebert <clebert.suco...@gmail.com> wrote: >> >>> Im not on my computer now but it sounds like you are doing a fail back >>> immediately after failed over. It takes some time (seconds) to the server >>> to activate on the backup. >>> >>> Later the server will need to copy the data back before it can be >>> activated in fail back mode. >>> >>> It sounds the live is not reaching backup for fail back. >>> >>> I will try looking it at it on Monday. Maybe you could post your example >>> at your GitHub fork. >>> >>> -- Clebert Suconic typing on the iPhone. >>> >>> > On Oct 18, 2015, at 08:15, Mihkel Nõges <mihkel.no...@transferwise.com> >>> wrote: >>> > >>> > Hello again! >>> > >>> > I would be very grateful If someone could answer my questions. We need >>> the high availability to work to use the broker in production. >>> > >>> > When I run the replicated-failback example in one machine (broker3) it >>> succeeds. >>> > >>> > It fails when I run the same test - exactly the same servers with >>> slightly modified client remotely. >>> > >>> > I run client in debug mode from my IDE with commented out serverStart >>> and killServer calls. >>> > Deleted data folders and started the servers: >>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$ >>> rm -R server0/data/ >>> > >>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$ >>> rm -R server1/data/ >>> > >>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$ >>> server0/bin/artemis-service start >>> > >>> > Starting artemis-service >>> > >>> > artemis-service is now running (23357) >>> > >>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$ >>> server1/bin/artemis-service start >>> > >>> > Starting artemis-service >>> > >>> > artemis-service is now running (23383) >>> > >>> > Starting client and stopping on breakpoint at line 103: >>> > //ServerUtil.killServer(server0); >>> > // Step 11. Acknowledging the 2nd half of the sent messages will fail >>> as failover to the >>> > // backup server has occurred >>> > try { >>> > message0.acknowledge(); //line 103 >>> > killing server0 >>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$ >>> kill -9 23357 >>> > >>> > Proceeding to breakpoint at line 121: >>> > //server0 = ServerUtil.startServer(args[0], >>> ReplicatedFailbackExample.class.getSimpleName() + "0", 0, 10000); >>> > >>> > // Step 11. Acknowledging the 2nd half of the sent messages will fail >>> as failover to the >>> > // backup server has occurred >>> > try { >>> > message0.acknowledge(); // line 121 >>> > Starting server0: >>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$ >>> server0/bin/artemis-service start >>> > >>> > Starting artemis-service >>> > >>> > artemis-service is now running (24240) >>> > >>> > Server0 writes ERROR to it's log (see attached server0_artemis.log). >>> > Now when trying to proceed with the client it writes the following in >>> the log and does not exit, but remains hanging forever: >>> > >>> > Oct 18, 2015 2:55:34 PM >>> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl >>> fail >>> > >>> > WARN: AMQ212037: Connection failure has been detected: AMQ119015: The >>> connection was disconnected because of server shutdown [code=DISCONNECTED] >>> > >>> > Got message: This is text message 20 (redelivered?: false) >>> > >>> > Got exception while acknowledging message: AMQ119014: Timed out after >>> waiting 30,000 ms for response when sending packet 43 >>> > >>> > Got message: This is text message 21 (redelivered?: false) >>> > >>> > Got message: This is text message 22 (redelivered?: false) >>> > >>> > Got message: This is text message 23 (redelivered?: false) >>> > >>> > Got message: This is text message 24 (redelivered?: false) >>> > >>> > Got message: This is text message 25 (redelivered?: false) >>> > >>> > Got message: This is text message 26 (redelivered?: false) >>> > >>> > Got message: This is text message 27 (redelivered?: false) >>> > >>> > Got message: This is text message 28 (redelivered?: false) >>> > >>> > Got message: This is text message 29 (redelivered?: false) >>> > >>> > As a result the slave (server1) remains stopped, not restarted as >>> expected and the master (server0) process appears to be running but does >>> not accept any connections. >>> > >>> > Exactly the same behavior is observable every time I try this. >>> > >>> > BR! >>> > Mihkel >>> > >>> >> On 13 October 2015 at 20:17, Mihkel Nõges < >>> mihkel.no...@transferwise.com> wrote: >>> >> Hi Clebert, >>> >> >>> >> No test, just doing it on command line with standalone servers. I'm >>> using 1.1.0 installed with wget, not the snapshot. >>> >> >>> >> I'm wondering what should be the suggested procedure for admins to do >>> changes to HA cluster of 2 or 3 nodes of Artemis. If one of the nodes is >>> master by configuration, do they need to change it's config before >>> restarting it to become slave to have seamless change process and make some >>> instance master by configuration only if all the instances need to be >>> restarted? >>> >> >>> >> I tried also a cluster with 2 masters and 2 slaves with 2 separate >>> group-name values, but for some reason the second master I started became >>> slave for the first immediately. I expected it to become a clustered >>> load-balancing parallel master. Our loads are not yet that high to require >>> more than one master, so it's just a theoretical question. >>> >> >>> >> BR! >>> >> Mihkel >>> >> >>> >>> On 13 October 2015 at 20:03, Clebert Suconic < >>> clebert.suco...@gmail.com> wrote: >>> >>> The master needs to copy its data from the backup back to live before >>> >>> it's activated. >>> >>> >>> >>> Do you have a test replicating this? >>> >>> >>> >>> Did you try the snapshot build? >>> >>> >>> >>> On Tue, Oct 13, 2015 at 11:58 AM, Mihkel Nõges >>> >>> <mihkel.no...@transferwise.com> wrote: >>> >>> > Hi, >>> >>> > >>> >>> > I configured replicating HA master-slave of Artemis 1.1.0 instances >>> on >>> >>> > Ubuntu 14.04.3. >>> >>> > >>> >>> > When I kill master the slave takes over as expected and starts >>> serving as >>> >>> > new master. When I then start the old master, it fails with the >>> following >>> >>> > errors in the log: >>> >>> > >>> >>> > 16:35:46,476 ERROR [org.apache.activemq.artemis.core.server] >>> AMQ224008: >>> >>> > Failed to store id: java.lang.IllegalStateException: Cannot find >>> add info 1 >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799) >>> >>> > [artemis-journal-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183) >>> >>> > [artemis-journal-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79) >>> >>> > [artemis-journal-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java: >>> 1784) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java: >>> 1625) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java: >>> 1535) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60] >>> >>> > >>> >>> > 16:35:46,572 WARN [org.apache.activemq.artemis.core.server] >>> AMQ222173: >>> >>> > Queue jms.queue.DLQ is duplicated during reload. This queue will be >>> renamed >>> >>> > as jms.queue.DLQ-0 >>> >>> > 16:35:46,572 ERROR [org.apache.activemq.artemis.core.server] >>> AMQ224000: >>> >>> > Failure in initialisation: java.lang.IllegalStateException: Cursor >>> 2 had >>> >>> > already been created >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java: >>> 1631) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java: >>> 1535) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at >>> >>> > >>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249) >>> >>> > [artemis-server-1.1.0.jar:1.1.0] >>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60] >>> >>> > >>> >>> > As a result both master and the slave remain unaccessible and no >>> further >>> >>> > restarts solve the situation. >>> >>> > >>> >>> > Attached also master and slave broker.xml files. >>> >>> > >>> >>> > BR! >>> >>> > >>> >>> > Mihkel Nõges >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Clebert Suconic >>> > >>> >> >> -- Clebert Suconic