Greetings!
We are having a devil of a time trying to reduce delay during a failover event.
We’ve set our URL to
(tcp://dm-activemq-live-svc:61616,tcp://dm-activemq-backup-svc:61617)?ha=true&reconnectAttempts=200&initialConnectAttempts=200&clientFailureCheckPeriod=10000&connectionTTL=10000&callTimeout=10000
But nothing seems to reduce the time it takes for a sender to detect the issue
and failover. With or without these parameters, it takes a full minute to
recover after failover. Are there other parameters to adjust? Something in the
broker? Is there some internal retry loop and timer that waits longer that we
can influence?
We are testing failover by killing one of the AMQ pods. This almost always
succeeds without issue, except occasionally it doesn’t.
The backup broker log shows that it assumes control very quickly
2024-02-16 22:42:56,523 INFO [org.apache.activemq.artemis.core.server]
AMQ221007: Server is now live
2024-02-16 22:42:56,533 INFO [org.apache.activemq.artemis.core.server]
AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61617 for protocols
[CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
2024-02-16 22:42:58,074 INFO
[net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER=
{"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="},
BODY={"message_type":"void"}
2024-02-16 22:42:58,254 INFO
[net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] DELIVER: HEADER=
{"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="},
BODY={"message_type":"void"}
In our app logs, we show that we are attempting to send a message to the AMQ
broker at the time of failover:
2024-02-16T22:42:55.981 [http-nio-9910-exec-9]
JmsRpcClientChannel.prepareCall:84 [9v2zwvclclrc] INFO - REQUEST OUT: {
"header":
{"version":1,"type":"get_task_status_request","id":"9v2zwvclclrc","api":"test_harness","method":"get_task_status","instance":"combined","authorization":"***REDACTED***"},
"body":{"id":"d1123865-ac47-4238-ab02-5f2324a43264","progress_start_index":0,"message_type":"get_task_status_request"}
}
But then it takes 40 seconds for the “10000ms” timeout to happen.
2024-02-16T22:43:35.988 [http-nio-9910-exec-9] JmsProducerPool.send_:376
[9v2zwvclclrc] WARN - Error sending message, will retry javax.jms.JMSException:
AMQ219014: Timed out after waiting 10000 ms for response when sending packet 71
Our problem is really that this 40-second delay seems to push everything back
so that recovery takes over a minute. Once we get into the one-minute mark, we
start hitting several timeouts, like our own RPC timeout setting and the nginx
ingress controller for K8S. How can we know why this takes so long? We’ve set
every timeout we can find for the AMQ client to 10 seconds. Is there some
other setting we need to adjust?
Meanwhile the primary broker pod has returned, and the backup decides it is not
needed:
2024-02-16 22:43:16,115 INFO [org.apache.activemq.artemis.core.server]
AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.31.2
[10952195-b6ec-11ee-9c87-aa03cb64206a] stopped, uptime 16 minutes
2024-02-16 22:43:16,115 INFO [org.apache.activemq.artemis.core.server]
AMQ221039: Restarting as Replicating backup server after live restart
But our app is waiting for its response on the reply-to queue, and gets this
error nearly a minute later:
2024-02-16T22:43:58.033 [Thread-6] JmsStaticConnectionPool.onException:78 []
ERROR - Receive error occurred. javax.jms.JMSException:
ActiveMQDisconnectedException[errorType=DISCONNECTED message=AMQ219015: The
connection was disconnected because of server shutdown]
This is the stack trace at time of the first timeout:
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:550)
at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452)
at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendFullMessage(ActiveMQSessionContext.java:588)
at
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.sendRegularMessage(ClientProducerImpl.java:305)
at
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:277)
at
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:147)
at
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:129)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:483)
at
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:221)
at
net.redpoint.ipc.jms.JmsProducerPool.send_(JmsProducerPool.java:372)
at
net.redpoint.ipc.jms.JmsProducerPool.sendRequest(JmsProducerPool.java:301)
at
net.redpoint.ipc.jms.JmsRpcClientChannel.sendRequest(JmsRpcClientChannel.java:228)
at
net.redpoint.ipc.jms.JmsRpcClientChannel.invokeRaw(JmsRpcClientChannel.java:202)
at
net.redpoint.ipc.jms.JmsRpcClientChannel.call(JmsRpcClientChannel.java:101)
at
net.redpoint.ipc.clients.RpcClientBase._sync(RpcClientBase.java:169)
at
net.redpoint.ipc.clients.RpcClientBase._rpc(RpcClientBase.java:237)
at
net.redpoint.rpdm.ipc.clients.TestHarnessClient.getTaskStatus(TestHarnessClient.java:229)
at
net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$25(TestHarnessHttpServer.java:390)
at
net.redpoint.ipc.SecurityControl.doAsNoThrow(SecurityControl.java:272)
at
net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$26(TestHarnessHttpServer.java:390)
at
net.redpoint.ipc.InstanceControl.doAsNoThrow(InstanceControl.java:84)
at
net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.getTaskStatus(TestHarnessHttpServer.java:390)
at
jdk.internal.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
at
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146)
at
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189)
at
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219)
at
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93)
at
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
at
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
at
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
at
org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
at
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
at
org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240)
at
org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697)
at
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
at
org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
at
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:357)
at
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
at
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:340)
at
org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:391)
at
org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
at
org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744)
at
org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
at
org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
at
org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT
message=AMQ219014: Timed out after waiting 10000 ms for response when sending
packet 71]
... 65 more
This is the stack trace at the second timeout message
at
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:714)
at
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:735)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:868)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:856)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:802)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:566)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1407)
at
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:98)
at
org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:212)
at
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$CloseRunnable.run(ClientSessionFactoryImpl.java:1172)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: ActiveMQDisconnectedException[errorType=DISCONNECTED
message=AMQ219015: The connection was disconnected because of server shutdown]
... 7 more
[rg] <https://www.redpointglobal.com/>
John Lilley
Data Management Chief Architect, Redpoint Global Inc.
34 Washington Street, Suite 205 Wellesley Hills, MA 02481
M: +1 7209385761<tel:+1%207209385761> |
[email protected]<mailto:[email protected]>
PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is confidential
and is intended solely for the use of the individual(s) to whom it is
addressed. If you believe you received this e-mail in error, please notify the
sender immediately, delete the e-mail from your computer and do not copy, print
or disclose it to anyone else. If you properly received this e-mail as a
customer, partner or vendor of Redpoint, you should maintain its contents in
confidence subject to the terms and conditions of your agreement(s) with
Redpoint.