Re: [Artemis] Master fails to start up after failback

Mihkel Nõges Tue, 20 Oct 2015 07:28:34 -0700

Sorry, the last mail went out too fast. Instead of shared-store I had
replication


On 20 October 2015 at 17:24, Mihkel Nõges <mihkel.no...@transferwise.com>
wrote:

> Also I had a question earlier about having more than one Artemis master in
> single cluster. When I tried this it resulted in only one master becoming a
> master, the other one became a slave for the first one started even though
> I set different group-name values for them in broker.xml. Is this expected
> behavior?
>
> <ha-policy>
>    <replication>
>       <master>
>          <group-name>ha-cluster1</group-name>
>       </master>
>    </replication>
> </ha-policy>
>
> <ha-policy>
>    <replication>
>       <master>
>          <group-name>ha-cluster2</group-name>
>       </master>
>    </replication>
> </ha-policy>
>
> Mihkel
>
> On 20 October 2015 at 16:53, Mihkel Nõges <mihkel.no...@transferwise.com>
> wrote:
>
>> Hi Tim, Clebert!
>>
>> Yes we considered also the alternatives (
>> http://activemq.apache.org/masterslave.html):
>> *Shared Storage:*
>>
>> We do not have high performance shared storage solution. We have some
>> solution for our current file storage needs, but it's I/O is said to be
>> very slow and would need to be extended to support extra load.
>>
>> *Replicated LevelDB:*
>>
>> It sounds cool, but I'm a little bit afraid of moving from one
>> experimental solution to the next. I noticed LevelDB does not support some
>> of the features we need like Scheduled message delivery:
>> http://activemq.apache.org/replicated-leveldb-store.html
>> The LevelDB store does not yet support storing data associated with Delay
>> and Schedule Message Delivery. Those are are stored in a separate
>> non-replicated KahaDB data files. Unexpected results will occur if you use
>> Delay and Schedule Message Delivery with the replicated leveldb store since
>> that data will be not be there when the master fails over to a slave.
>>
>> Note like this make me feel very uneasy about the solution.
>>
>> *JDBC:*
>>
>> So it seems to me like the most reliable highly available messaging
>> solution in ActiveMQ 5 is JDBC. We have MySQL running as our main DB and
>> setting up a second DB for messaging would be fairly simple for standard
>> procedures of maintenance, backups and disaster recovery etc.
>>
>>
>> I consider this only as a temporary solution until we can use more
>> performant alternative configuration and I'm not expecting Artemis to
>> implement support for JDBC storage ever.
>>
>> We are using messaging in process of splitting our monolithic application
>> into micro-services. As this is gradual process, the amount of messages
>> would be very small in the beginning, so having low performing but reliable
>> JDBC backed broker configuration seems good for start.
>>
>> I was trying to find the more orthodox approach, but could not find or
>> get good suggestions. I tried disabling fail-back and starting master like
>> that resulted in both servers spamming in the logs another server with the
>> same ID is running. Do I understand correctly I should have backed up and
>> removed the /data folder of the master, reconfigured it as a slave and
>> started it then?
>>
>> Can you give me some overview of already existing deployments of highly
>> available and failing over (not necessarily failing back) Artemis
>> installations in production I may change my mind about going with it from
>> the start.
>>
>> Mihkel
>>
>>
>> On 20 October 2015 at 16:19, Clebert Suconic <clebert.suco...@gmail.com>
>> wrote:
>>
>>> As far as I know ActiveMQ5 doesn't do failback on the master-slave
>>> journal... and it doesn't have any protocol to sync the data between
>>> master and slave.
>>>
>>>
>>> There is a small regression on the failback that we are dealing now...
>>> if you set the master as a backup it would work fine...
>>>
>>>
>>> I think your testcase is a bit non orthodox...
>>>
>>> TBH production guys usually don't use failback.. they keep the backup
>>> until they can get to a quiet period and then do the failback (or
>>> restart the system) under low load.
>>>
>>>
>>> I also second Tim Bain on  your choice for JDBC.
>>>
>>> I actually always say this.. if you can use JDBC as a storage for
>>> messaging.. don't use messaging at all.. just store and retrieve from
>>> the Database.
>>>
>>>
>>> There's a JIRA open for Artemis on JDBC.. but usually those things are
>>> written because users want, not need it.
>>>
>>> On Tue, Oct 20, 2015 at 3:12 AM, Mihkel Nõges
>>> <mihkel.no...@transferwise.com> wrote:
>>> > Yes I saw that issue too and set myself as watcher of this when it was
>>> > created. I did not think it could be exactly the same as it is
>>> described to
>>> > present itself only in narrow timing related conditions. My case seems
>>> to
>>> > be much more broad and basic. Seems like nobody actually tried to set
>>> this
>>> > up in realistic situation.
>>> >
>>> > Do you know of any existing production deployments of Artemis (or
>>> hornetq)
>>> > with failover? I thought Artemis as based on hornetq  should have it's
>>> > features as stable as last hornetq version. We have already used
>>> embedded
>>> > hornetq for some time happily. I think it would make a lot of sense to
>>> > grade the Artemis features publicly as what is their maturity and usage
>>> > statistics of each feature if known, so it would be easier to compare
>>> the
>>> > brokers even among the 3 variants of ActiveMQ family.
>>> >
>>> > I think it's more safe for us to start building our first messaging
>>> > features on ActiveMQ 5.12.1 with JDBC backed Master-Slave instead of
>>> > Artemis and switch to Artemis once it has become more stable and also
>>> our
>>> > needs for scalability have grown to make it reasonable. Right now it
>>> seems
>>> > there are still too big blockers which may threaten the stability of
>>> our
>>> > system in Artemis.
>>> >
>>> > I did not mean this letter to be in no means negative. In the opposite
>>> I
>>> > really hope Artemis would do well as it comes with such a great
>>> technical
>>> > foundation and elegant ideas. I think the best for Artemis would be to
>>> find
>>> > users that can trust it's features and improve it as they grow. This
>>> means
>>> > the nucleus of Artemis must be really solid and stable.
>>> >
>>> > BR!
>>> > Mihkel Nõges
>>> >
>>> >
>>> >
>>> > On 19 October 2015 at 22:15, Clebert Suconic <
>>> clebert.suco...@gmail.com>
>>> > wrote:
>>> >
>>> >> Looks related to me:
>>> >>
>>> >> https://issues.apache.org/jira/browse/ARTEMIS-256
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Oct 19, 2015 at 4:04 AM, Mihkel Nõges
>>> >> <mihkel.no...@transferwise.com> wrote:
>>> >> > Basic flow of getting unresponsive failback cluster:
>>> >> > Have machine with Ubuntu 14.04.3
>>> >> >
>>> >> >    1. Install libaio1, Java 1.8.0_60, maven 3.3.3, download and
>>> extract
>>> >> >    apache-artemis-1.1.0-bin
>>> >> >    <
>>> >>
>>> http://www.eu.apache.org/dist/activemq/activemq-artemis/1.1.0/apache-artemis-1.1.0-bin.tar.gz
>>> >> >
>>> >> > in
>>> >> >    /opt
>>> >> >    2. run $ mvn -Prelease install and $ mnv verify in
>>> >> >
>>>  /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback
>>> >> >    SUCCESS
>>> >> >    3. Clean data folders and starts both servers manually:
>>> >> >    $
>>> >> >    cd
>>> >>
>>> /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target
>>> >> >    $ rm -R server0/data/
>>> >> >    $ rm -R server1/data/
>>> >> >    $ server0/bin/artemis-service start
>>> >> >    Starting artemis-service
>>> >> >    artemis-service is now running (7154)
>>> >> >    $ server1/bin/artemis-service start
>>> >> >    Starting artemis-service
>>> >> >    artemis-service is now running (7180)
>>> >> >    4. Kill master server and wait for slave to take over
>>> >> >    $ kill -9 7154
>>> >> >
>>> >> >    $ tail -f server1/log/artemis.log
>>> >> >    08:52:54,798 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221043:
>>> >> >    Protocol module found: [artemis-stomp-protocol]. Adding protocol
>>> >> support
>>> >> >    for: STOMP
>>> >> >    08:53:02,145 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221109:
>>> >> >    Apache ActiveMQ Artemis Backup Server version 1.1.0 [null]
>>> started,
>>> >> waiting
>>> >> >    live to fail before it gets active
>>> >> >    08:53:03,582 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221024:
>>> >> >    Backup server
>>> >> >
>>> ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 is
>>> >> >    synchronized with live-server.
>>> >> >    08:53:03,777 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221031:
>>> >> >    backup announced
>>> >> >    08:55:59,292 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221037:
>>> >> >
>>> ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 to
>>> >> >    become 'live'
>>> >> >    08:55:59,302 WARN  [org.apache.activemq.artemis.core.client]
>>> >> AMQ212004:
>>> >> >    Failed to connect to server.
>>> >> >    08:55:59,778 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221003:
>>> >> >    trying to deploy queue jms.queue.exampleQueue
>>> >> >    08:55:59,829 WARN  [org.apache.activemq.artemis.core.client]
>>> >> AMQ212034:
>>> >> >    There are more than one servers on the network broadcasting the
>>> same
>>> >> node
>>> >> >    id. You will see this message exactly once (per node) if a node
>>> is
>>> >> >    restarted, in which case it can be safely ignored. But if it is
>>> logged
>>> >> >    continuously it means you really do have more than one node on
>>> the
>>> >> same
>>> >> >    network active concurrently with the same node id. This could
>>> occur
>>> >> if you
>>> >> >    have a backup node active at the same time as its live node.
>>> >> >    nodeID=64ddff0f-7636-11e5-bfa8-f5004e6195f0
>>> >> >    08:55:59,836 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221007:
>>> >> >    Server is now live
>>> >> >    08:55:59,869 INFO  [org.apache.activemq.artemis.core.server]
>>> >> AMQ221020:
>>> >> >    Started Acceptor at broker3:61617 for protocols
>>> >> >    [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE]
>>> >> >    5.
>>> >> >
>>> >> >    Start master again and observer the logs:
>>> >> >    $ server0/bin/artemis-service start
>>> >> >    Starting artemis-service
>>> >> >    artemis-service is now running (7388)
>>> >> >
>>> >> > $ tail -f server0/log/artemis.log
>>> >> > 08:57:24,625 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221012:
>>> >> > Using AIO Journal
>>> >> > 08:57:24,694 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221043:
>>> >> > Protocol module found: [artemis-server]. Adding protocol support
>>> for:
>>> >> CORE
>>> >> > 08:57:24,702 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221043:
>>> >> > Protocol module found: [artemis-amqp-protocol]. Adding protocol
>>> support
>>> >> > for: AMQP
>>> >> > 08:57:24,731 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221043:
>>> >> > Protocol module found: [artemis-hornetq-protocol]. Adding protocol
>>> >> support
>>> >> > for: HORNETQ
>>> >> > 08:57:24,733 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221043:
>>> >> > Protocol module found: [artemis-mqtt-protocol]. Adding protocol
>>> support
>>> >> > for: MQTT
>>> >> > 08:57:24,743 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221043:
>>> >> > Protocol module found: [artemis-openwire-protocol]. Adding protocol
>>> >> support
>>> >> > for: OPENWIRE
>>> >> > 08:57:24,878 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221043:
>>> >> > Protocol module found: [artemis-stomp-protocol]. Adding protocol
>>> support
>>> >> > for: STOMP
>>> >> > 08:57:25,082 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221109:
>>> >> > Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] started,
>>> >> waiting
>>> >> > live to fail before it gets active
>>> >> > 08:57:27,043 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221024:
>>> >> > Backup server
>>> >> > ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0
>>> is
>>> >> > synchronized with live-server.
>>> >> > 08:57:27,948 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221031:
>>> >> > backup announced
>>> >> > 08:57:31,227 WARN  [org.apache.activemq.artemis.core.client]
>>> AMQ212037:
>>> >> > Connection failure has been detected: AMQ119015: The connection was
>>> >> > disconnected because of server shutdown [code=DISCONNECTED]
>>> >> > 08:57:31,252 WARN  [org.apache.activemq.artemis.core.client]
>>> AMQ212037:
>>> >> > Connection failure has been detected: AMQ119015: The connection was
>>> >> > disconnected because of server shutdown [code=DISCONNECTED]
>>> >> > 08:57:31,307 WARN  [org.apache.activemq.artemis.core.client]
>>> AMQ212037:
>>> >> > Connection failure has been detected: AMQ119015: The connection was
>>> >> > disconnected because of server shutdown [code=DISCONNECTED]
>>> >> > 08:57:31,339 INFO  [org.apache.activemq.artemis.core.server]
>>> AMQ221037:
>>> >> > ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0
>>> to
>>> >> > become 'live'
>>> >> > 08:57:31,360 WARN  [org.apache.activemq.artemis.core.client]
>>> AMQ212004:
>>> >> > Failed to connect to server.
>>> >> > 08:57:31,413 ERROR [org.apache.activemq.artemis.core.server]
>>> AMQ224008:
>>> >> > Failed to store id: java.lang.IllegalStateException: Cannot find add
>>> >> info 1
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799)
>>> >> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183)
>>> >> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79)
>>> >> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:
>>> 1784)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>>> 1625)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>>> 1535)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>>> >> > 08:57:31,540 WARN  [org.apache.activemq.artemis.core.server]
>>> AMQ222173:
>>> >> > Queue jms.queue.exampleQueue is duplicated during reload. This
>>> queue will
>>> >> > be renamed as jms.queue.exampleQueue-0
>>> >> > 08:57:31,550 ERROR [org.apache.activemq.artemis.core.server]
>>> AMQ224000:
>>> >> > Failure in initialisation: java.lang.IllegalStateException: Cursor
>>> 2 had
>>> >> > already been created
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>>> 1631)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>>> 1535)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at
>>> >> >
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>>> >> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>>> >> >
>>> >> >
>>> >> > On 19 October 2015 at 10:31, Mihkel Nõges <
>>> mihkel.no...@transferwise.com
>>> >> >
>>> >> > wrote:
>>> >> >
>>> >> >> Hi Clebert,
>>> >> >>
>>> >> >> I do not have other code to share with you but the example code in
>>> >> Artemis
>>> >> >> 1.1.0 binary deployment package. I'm running
>>> >> >> org.apache.activemq.artemis.jms.example.ReplicatedFailbackExample
>>> >> >>
>>> >> >> And only commented out the serverStart and killServer calls which
>>> I am
>>> >> >> doing manually.
>>> >> >>
>>> >> >> I do not think I do any of the steps too fast as I tail the server
>>> log
>>> >> >> files in parallel and see everything is finished when I start the
>>> fail
>>> >> >> back. I have waited many minutes in between.
>>> >> >>
>>> >> >> Only changes in configuration to the test is changing localhost
>>> >> addresses
>>> >> >> with broker3 to make the cluster accessible remotely.
>>> >> >>
>>> >> >> BR!
>>> >> >> MIhkel
>>> >> >>
>>> >> >> On 18 October 2015 at 17:49, Clebert <clebert.suco...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >>> Im not on my computer now but it sounds like you are doing a fail
>>> back
>>> >> >>> immediately after failed over. It takes some time (seconds) to the
>>> >> server
>>> >> >>> to activate on the backup.
>>> >> >>>
>>> >> >>> Later the server will need to copy the data back before it can be
>>> >> >>> activated in fail back mode.
>>> >> >>>
>>> >> >>> It sounds the live is not reaching backup for fail back.
>>> >> >>>
>>> >> >>> I will try looking it at it on Monday. Maybe you could post your
>>> >> example
>>> >> >>> at your GitHub fork.
>>> >> >>>
>>> >> >>> -- Clebert Suconic typing on the iPhone.
>>> >> >>>
>>> >> >>> > On Oct 18, 2015, at 08:15, Mihkel Nõges <
>>> >> mihkel.no...@transferwise.com>
>>> >> >>> wrote:
>>> >> >>> >
>>> >> >>> > Hello again!
>>> >> >>> >
>>> >> >>> > I would be very grateful If someone could answer my questions.
>>> We
>>> >> need
>>> >> >>> the high availability to work to use the broker in production.
>>> >> >>> >
>>> >> >>> > When I run the replicated-failback example in one machine
>>> (broker3)
>>> >> it
>>> >> >>> succeeds.
>>> >> >>> >
>>> >> >>> > It fails when I run the same test - exactly the same servers
>>> with
>>> >> >>> slightly modified client remotely.
>>> >> >>> >
>>> >> >>> > I run client in debug mode from my IDE with commented out
>>> serverStart
>>> >> >>> and killServer calls.
>>> >> >>> > Deleted data folders and started the servers:
>>> >> >>> > artemis@broker3
>>> >>
>>> :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> >> >>> rm -R server0/data/
>>> >> >>> >
>>> >> >>> > artemis@broker3
>>> >>
>>> :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> >> >>> rm -R server1/data/
>>> >> >>> >
>>> >> >>> > artemis@broker3
>>> >>
>>> :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> >> >>> server0/bin/artemis-service start
>>> >> >>> >
>>> >> >>> > Starting artemis-service
>>> >> >>> >
>>> >> >>> > artemis-service is now running (23357)
>>> >> >>> >
>>> >> >>> > artemis@broker3
>>> >>
>>> :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> >> >>> server1/bin/artemis-service start
>>> >> >>> >
>>> >> >>> > Starting artemis-service
>>> >> >>> >
>>> >> >>> > artemis-service is now running (23383)
>>> >> >>> >
>>> >> >>> > Starting client and stopping on breakpoint at line 103:
>>> >> >>> > //ServerUtil.killServer(server0);
>>> >> >>> > // Step 11. Acknowledging the 2nd half of the sent messages
>>> will fail
>>> >> >>> as failover to the
>>> >> >>> > // backup server has occurred
>>> >> >>> > try {
>>> >> >>> >    message0.acknowledge();  //line 103
>>> >> >>> > killing server0
>>> >> >>> > artemis@broker3
>>> >>
>>> :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> >> >>> kill -9 23357
>>> >> >>> >
>>> >> >>> > Proceeding to breakpoint at line 121:
>>> >> >>> > //server0 = ServerUtil.startServer(args[0],
>>> >> >>> ReplicatedFailbackExample.class.getSimpleName() + "0", 0, 10000);
>>> >> >>> >
>>> >> >>> > // Step 11. Acknowledging the 2nd half of the sent messages
>>> will fail
>>> >> >>> as failover to the
>>> >> >>> > // backup server has occurred
>>> >> >>> > try {
>>> >> >>> >    message0.acknowledge(); // line 121
>>> >> >>> > Starting server0:
>>> >> >>> > artemis@broker3
>>> >>
>>> :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> >> >>> server0/bin/artemis-service start
>>> >> >>> >
>>> >> >>> > Starting artemis-service
>>> >> >>> >
>>> >> >>> > artemis-service is now running (24240)
>>> >> >>> >
>>> >> >>> > Server0 writes ERROR to it's log (see attached
>>> server0_artemis.log).
>>> >> >>> > Now when trying to proceed with the client it writes the
>>> following in
>>> >> >>> the log and does not exit, but remains hanging forever:
>>> >> >>> >
>>> >> >>> > Oct 18, 2015 2:55:34 PM
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl
>>> >> >>> fail
>>> >> >>> >
>>> >> >>> > WARN: AMQ212037: Connection failure has been detected:
>>> AMQ119015: The
>>> >> >>> connection was disconnected because of server shutdown
>>> >> [code=DISCONNECTED]
>>> >> >>> >
>>> >> >>> > Got message: This is text message 20 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got exception while acknowledging message: AMQ119014: Timed out
>>> after
>>> >> >>> waiting 30,000 ms for response when sending packet 43
>>> >> >>> >
>>> >> >>> > Got message: This is text message 21 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 22 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 23 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 24 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 25 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 26 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 27 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 28 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > Got message: This is text message 29 (redelivered?: false)
>>> >> >>> >
>>> >> >>> > As a result the slave (server1) remains stopped, not restarted
>>> as
>>> >> >>> expected and the master (server0) process appears to be running
>>> but
>>> >> does
>>> >> >>> not accept any connections.
>>> >> >>> >
>>> >> >>> > Exactly the same behavior is observable every time I try this.
>>> >> >>> >
>>> >> >>> > BR!
>>> >> >>> > Mihkel
>>> >> >>> >
>>> >> >>> >> On 13 October 2015 at 20:17, Mihkel Nõges <
>>> >> >>> mihkel.no...@transferwise.com> wrote:
>>> >> >>> >> Hi Clebert,
>>> >> >>> >>
>>> >> >>> >> No test, just doing it on command line with standalone
>>> servers. I'm
>>> >> >>> using 1.1.0 installed with wget, not the snapshot.
>>> >> >>> >>
>>> >> >>> >> I'm wondering what should be the suggested procedure for
>>> admins to
>>> >> do
>>> >> >>> changes to HA cluster of 2 or 3 nodes of Artemis. If one of the
>>> nodes
>>> >> is
>>> >> >>> master by configuration, do they need to change it's config before
>>> >> >>> restarting it to become slave to have seamless change process and
>>> make
>>> >> some
>>> >> >>> instance master by configuration only if all the instances need
>>> to be
>>> >> >>> restarted?
>>> >> >>> >>
>>> >> >>> >> I tried also a cluster with 2 masters and 2 slaves with 2
>>> separate
>>> >> >>> group-name values, but for some reason the second master I started
>>> >> became
>>> >> >>> slave for the first immediately. I expected it to become a
>>> clustered
>>> >> >>> load-balancing parallel master. Our loads are not yet that high to
>>> >> require
>>> >> >>> more than one master, so it's just a theoretical question.
>>> >> >>> >>
>>> >> >>> >> BR!
>>> >> >>> >> Mihkel
>>> >> >>> >>
>>> >> >>> >>> On 13 October 2015 at 20:03, Clebert Suconic <
>>> >> >>> clebert.suco...@gmail.com> wrote:
>>> >> >>> >>> The master needs to copy its data from the backup back to live
>>> >> before
>>> >> >>> >>> it's activated.
>>> >> >>> >>>
>>> >> >>> >>> Do you have a test replicating this?
>>> >> >>> >>>
>>> >> >>> >>> Did you try the snapshot build?
>>> >> >>> >>>
>>> >> >>> >>> On Tue, Oct 13, 2015 at 11:58 AM, Mihkel Nõges
>>> >> >>> >>> <mihkel.no...@transferwise.com> wrote:
>>> >> >>> >>> > Hi,
>>> >> >>> >>> >
>>> >> >>> >>> > I configured replicating HA master-slave of Artemis 1.1.0
>>> >> instances
>>> >> >>> on
>>> >> >>> >>> > Ubuntu 14.04.3.
>>> >> >>> >>> >
>>> >> >>> >>> > When I kill master the slave takes over as expected and
>>> starts
>>> >> >>> serving as
>>> >> >>> >>> > new master. When I then start the old master, it fails with
>>> the
>>> >> >>> following
>>> >> >>> >>> > errors in the log:
>>> >> >>> >>> >
>>> >> >>> >>> > 16:35:46,476 ERROR [org.apache.activemq.artemis.core.server]
>>> >> >>> AMQ224008:
>>> >> >>> >>> > Failed to store id: java.lang.IllegalStateException: Cannot
>>> find
>>> >> >>> add info 1
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799)
>>> >> >>> >>> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183)
>>> >> >>> >>> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79)
>>> >> >>> >>> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:
>>> >> >>> 1784)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>>> >> >>> 1625)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>>> >> >>> 1535)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>>> >> >>> >>> >
>>> >> >>> >>> > 16:35:46,572 WARN  [org.apache.activemq.artemis.core.server]
>>> >> >>> AMQ222173:
>>> >> >>> >>> > Queue jms.queue.DLQ is duplicated during reload. This queue
>>> will
>>> >> be
>>> >> >>> renamed
>>> >> >>> >>> > as jms.queue.DLQ-0
>>> >> >>> >>> > 16:35:46,572 ERROR [org.apache.activemq.artemis.core.server]
>>> >> >>> AMQ224000:
>>> >> >>> >>> > Failure in initialisation: java.lang.IllegalStateException:
>>> >> Cursor
>>> >> >>> 2 had
>>> >> >>> >>> > already been created
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>>> >> >>> 1631)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>>> >> >>> 1535)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at
>>> >> >>> >>> >
>>> >> >>>
>>> >>
>>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>>> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >> >>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>>> >> >>> >>> >
>>> >> >>> >>> > As a result both master and the slave remain unaccessible
>>> and no
>>> >> >>> further
>>> >> >>> >>> > restarts solve the situation.
>>> >> >>> >>> >
>>> >> >>> >>> > Attached also master and slave broker.xml files.
>>> >> >>> >>> >
>>> >> >>> >>> > BR!
>>> >> >>> >>> >
>>> >> >>> >>> > Mihkel Nõges
>>> >> >>> >>>
>>> >> >>> >>>
>>> >> >>> >>>
>>> >> >>> >>> --
>>> >> >>> >>> Clebert Suconic
>>> >> >>> >
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Clebert Suconic
>>> >>
>>>
>>>
>>>
>>> --
>>> Clebert Suconic
>>>
>>
>>
>

Re: [Artemis] Master fails to start up after failback

Reply via email to