I'm using ActiveMQ-5.5.1 with a centralized broker feeding approximately twenty other brokers via NetworkBridges. All of the brokers except for one is working perfectly and have been for years. We recently moved our Data Warehouse (DW) to the cloud and that broker seems to hang up and stop communicating four or five times a day.
I've used JMX to remotely monitor the centralized broker (HUB) and the DW broker. The HUB continues to move files to/from all of the brokers except for the DW. The HUB, via JMX, reports that the DW NetworkBridge is down, but the DW broker says the NetworkBridge is up. I turned on transport tracing for both the HUB and the DW brokers and I can clearly see the KeepAlive messages going to the DW broker and the responses coming back until the HUB reports the NetworkBridge to the DW is down. My JMX connection to the HUB continues to work and Heap and Nonheap usage seem well within design limits, but the JMX connection to the DW returns a timeout. I then tried logging into the DW (Linux box) and tried to run TOP. If took almost a minute for the letters T, O, P, to echo back which suggested to me that the box was under heavy cpu load. Just prior to the timeout, the DW JMX connection showed that Heap and Nonheap were within design limits. May supervisor asked two very valid questions: "How do I know if the DW Broker did or did not use up heap if I cannot see heap usage via JMX?" and "Could GC be stuck?". We also noticed that all ActiveMQ logging ceases while the broker is hung. The DW broker is supposed to run continuously. The DW itself instantiates several very large one shot processes every ten minutes and I suspect that this is what is causing the DW broker and JMX to hang. Does anyone have experience troubleshooting a problem like this? What should I do to prove that the problem is either the ActiveMQ broker or the processed that the DW is instantiating? If someone has seen this problem and fixed it, how did you fix it? The only way I found to fix the hung broker is to execute an activemq restart that times out after thirty seconds and then does a kill on the pid. -- Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html