i'm encountering a problem in our production environment that i can reproduce in an integration-testing setup. Durable topic subscriptions do not fully reconnect after an interruption in network connectivity, even though the ActiveMQ brokers re-establish their connection and messages flow across queues.

The high-level architecture is a store & forward setup similar to the backoffice + retail store example in "ActiveMQ in Action":

- the Central application (on Geronimo 2.1.7) and Central ActiveMQ (5.4.2) broker run on the same machine.

- multiple remote machines host a similar pairing of Remote ActiveMQ and Remote application.

- The apps are connecting to the standalone AMQ brokers via activemq-ra, ignoring the AMQ instance embedded in Geronimo.

- the Central app publishes to topics on the Central broker. The topics are dynamically included in the Remote brokers' networkConnector configuration, which looks like this:

<networkConnectors>
<networkConnector name="${ApplianceID}"
                              userName="${networkConnectorUserName}"
                              password="${networkConnectorPassword}"
uri="static://(ssl://${Central.ServerHostname}:${central_sslPortNumber})?initialReconnectDelay=5000&amp;maxReconnectDelay=10000&amp;useExponentialBackOff=false"
                              duplex="true"
                              dynamicOnly="true">
<dynamicallyIncludedDestinations>
<queue physicalName="org.nwea.queues.central.>"/>
<topic physicalName="org.nwea.topics.>"/>
</dynamicallyIncludedDestinations>
</networkConnector>
</networkConnectors>

- MDBs in the Remote application use durable subscriptions to connect to the topics on the Remote broker. We see the durable subs show up on the Central broker (via the web console).

Whenever there's a temporary loss of network connectivity (this happens form time to time with the provide hosting our Remotes), the Remote brokers can re-connect to the Central broker, but the durable subscriptions from Remote do not re-connect. They show up in the Remote broker's web console, but not in Central's. Messages on the Central broker's topics are not forwarded to the Remote broker's topics.

i've duplicated this behavior in our VMWare environment, the only place i can enable debug-level logging:

- i start a batch-publishing job on Central, watch the messages picked up and processed by Remote, then disable the network interface on Remote (i've done this for up to a minute so far). Central keeps publishing, and Remote finishes processing messages that were forwarded to its topics.

- i re-enable Remote's network interface, and see in the ActiveMQ logs that Remote authenticates to Central and that the DemandForwardingBridge is re-established. i see messages flowing on Advisory topics. i can send a message (via the Remote's AMQ console) to a dynamically included queue, and it's forwarded to Central. In Remote's AMQ console, i see the durable subscriptions form the Remote application's MDBs - but in Central's AMQ console, the durable subs appear as "offline".

The only way we've discovered to bring the durable subscriptions back on-line all the way to Central is to restart the Remote Geronimo instance. Once restarted, Remote picks up where it left off, and all the topic messages are retrieved and processed.

In the debug logs, we've noticed that when Remote AMQ re-connects after the outage, queue and topic connections seem to use different ports than before the outage, and wonder if this is part of the failure of durable subscriptions to reconnect.

i've already tried a few minor variations in the networkConnector configuration, the most recent being "useExponentialBackOff=false". In addition, i've enabled TCP keepalive in the transportConnectors:

<transportConnectors>
<transportConnector name="openwire" uri="tcp://0.0.0.0:${remote_openwirePortNumber}?keepAlive=true"/> <transportConnector name="ssl" uri="ssl://0.0.0.0:${remote_sslPortNumber}?keepAlive=true"/>
</transportConnectors>

We've already looked at various operating-system issues with the network stacks on our servers, and nothing seems to be amiss - no resource-starvation of any kind. And the point really is that we need the durable subs to survive a brief disconnect. AMQ itself seems to reconnect just fine. At the moment, getting rid of activemq-ra and the Geronimo resource adapters and moving to Spring's JMS support (as one consultant suggested) isn't an option for our production issues, regardless of how attractive it is in the bigger scheme of things.

This is a real problem for us and our customers. Any guidance is appreciated.
--

*Joe Niski*
Senior Developer - Information Services  |  NWEA™

PHONE 503.548.5207 | FAX 503.639.7873

NWEA.ORG <http://www.nwea.org/> | Partnering to help all kids learn™

Reply via email to