i'm encountering a problem in our production environment that i can
reproduce in an integration-testing setup. Durable topic subscriptions
do not fully reconnect after an interruption in network connectivity,
even though the ActiveMQ brokers re-establish their connection and
messages flow across queues.
The high-level architecture is a store & forward setup similar to the
backoffice + retail store example in "ActiveMQ in Action":
- the Central application (on Geronimo 2.1.7) and Central ActiveMQ
(5.4.2) broker run on the same machine.
- multiple remote machines host a similar pairing of Remote ActiveMQ and
Remote application.
- The apps are connecting to the standalone AMQ brokers via activemq-ra,
ignoring the AMQ instance embedded in Geronimo.
- the Central app publishes to topics on the Central broker. The topics
are dynamically included in the Remote brokers' networkConnector
configuration, which looks like this:
<networkConnectors>
<networkConnector name="${ApplianceID}"
userName="${networkConnectorUserName}"
password="${networkConnectorPassword}"
uri="static://(ssl://${Central.ServerHostname}:${central_sslPortNumber})?initialReconnectDelay=5000&maxReconnectDelay=10000&useExponentialBackOff=false"
duplex="true"
dynamicOnly="true">
<dynamicallyIncludedDestinations>
<queue physicalName="org.nwea.queues.central.>"/>
<topic physicalName="org.nwea.topics.>"/>
</dynamicallyIncludedDestinations>
</networkConnector>
</networkConnectors>
- MDBs in the Remote application use durable subscriptions to connect to
the topics on the Remote broker. We see the durable subs show up on the
Central broker (via the web console).
Whenever there's a temporary loss of network connectivity (this happens
form time to time with the provide hosting our Remotes), the Remote
brokers can re-connect to the Central broker, but the durable
subscriptions from Remote do not re-connect. They show up in the Remote
broker's web console, but not in Central's. Messages on the Central
broker's topics are not forwarded to the Remote broker's topics.
i've duplicated this behavior in our VMWare environment, the only place
i can enable debug-level logging:
- i start a batch-publishing job on Central, watch the messages picked
up and processed by Remote, then disable the network interface on Remote
(i've done this for up to a minute so far). Central keeps publishing,
and Remote finishes processing messages that were forwarded to its topics.
- i re-enable Remote's network interface, and see in the ActiveMQ logs
that Remote authenticates to Central and that the DemandForwardingBridge
is re-established. i see messages flowing on Advisory topics. i can send
a message (via the Remote's AMQ console) to a dynamically included
queue, and it's forwarded to Central. In Remote's AMQ console, i see the
durable subscriptions form the Remote application's MDBs - but in
Central's AMQ console, the durable subs appear as "offline".
The only way we've discovered to bring the durable subscriptions back
on-line all the way to Central is to restart the Remote Geronimo
instance. Once restarted, Remote picks up where it left off, and all the
topic messages are retrieved and processed.
In the debug logs, we've noticed that when Remote AMQ re-connects after
the outage, queue and topic connections seem to use different ports than
before the outage, and wonder if this is part of the failure of durable
subscriptions to reconnect.
i've already tried a few minor variations in the networkConnector
configuration, the most recent being "useExponentialBackOff=false". In
addition, i've enabled TCP keepalive in the transportConnectors:
<transportConnectors>
<transportConnector name="openwire"
uri="tcp://0.0.0.0:${remote_openwirePortNumber}?keepAlive=true"/>
<transportConnector name="ssl"
uri="ssl://0.0.0.0:${remote_sslPortNumber}?keepAlive=true"/>
</transportConnectors>
We've already looked at various operating-system issues with the network
stacks on our servers, and nothing seems to be amiss - no
resource-starvation of any kind. And the point really is that we need
the durable subs to survive a brief disconnect. AMQ itself seems to
reconnect just fine. At the moment, getting rid of activemq-ra and the
Geronimo resource adapters and moving to Spring's JMS support (as one
consultant suggested) isn't an option for our production issues,
regardless of how attractive it is in the bigger scheme of things.
This is a real problem for us and our customers. Any guidance is
appreciated.
--
*Joe Niski*
Senior Developer - Information Services | NWEA™
PHONE 503.548.5207 | FAX 503.639.7873
NWEA.ORG <http://www.nwea.org/> | Partnering to help all kids learn™