Sorry I forgot to update the thread on this. Yes we tested the scenarios in bare metal systems, and found the durability tests passed there. So I am guessing the issue we saw came from using the VirtualBox VM. It worked as expected when we eliminated VirtualBox from the equation.
Thanks, Anindya Haldar Oracle Marketing Cloud > On Jun 13, 2018, at 7:31 PM, Justin Bertram <jbert...@apache.org> wrote: > > Did you have a chance to test this scenario on a bare metal system? If so, > what were the results? If not, did you find the root cause of the missing > messages in something related to the VM? > > Based on your other recent email to the list about HA I assume you've moved > past this issue, but I wanted to confirm for sure. > > Thanks! > > > Justin > > On Wed, Feb 14, 2018 at 11:32 AM, Anindya Haldar <anindya.hal...@oracle.com> > wrote: > >> We powered off the VM while the producers were kicking and alive, and no >> one was consuming. Then we tallied for the number of committed messages by >> the producers. After that we restart the VM, and then restart the broker, >> and take the queue stats. Then we use the JMS QueueBrowser API to count the >> number of actual messages in the queues. Finally we consumer all messages >> from the queues and tally them against the number of messages committed by >> the producers at the time the failure was triggered. >> >> We are looking forward to run the tests using a bare metal system in order >> to eliminate VirtualBox VM from the picture. >> >> Thanks, >> Anindya Haldar >> >> Oracle Marking Cloud >> >> -----Original Message----- >> From: Justin Bertram [mailto:jbert...@apache.org] >> Sent: Wednesday, February 14, 2018 6:58 AM >> To: users@activemq.apache.org >> Subject: Re: Artemis 2.4.0 message loss in durability tests upon system >> power-off >> >> The "messages added" metric for a queue is volatile so when the broker is >> stopped it will be reset. When the broker is started again the "messages >> added" will be 0. In your test you say the broker is "powered off" and >> then you "resume" the broker. What exactly does this mean? It seems clear >> that you aren't actually shutting down the broker otherwise the "messages >> added" would be 0 when you started your consumers. Please clarify. >> >> Also, how do the broker's metrics compare with the producer's and >> consumer's metrics? I assume here that the producer and consumer are both >> tracking the number of messages they produce/consume. >> >> Also, do you have a way to reproduce this without a VM? >> >> >> Justin >> >> On Mon, Feb 5, 2018 at 7:11 PM, Anindya Haldar <anindya.hal...@oracle.com> >> wrote: >> >>> We are in the process of qualifying Artemis 2.4.0 for our stack. We >>> ran some message durability related tests in the face of a power >>> failure. The broker is running in a VirtualBox VM, and is set up in a >>> system where disk caching is disabled. The VM runs OEL Linux 7, and >>> the VirtualBox Manger itself is running under Windows 7 Enterprise. >>> >>> >>> >>> We use JMS API and persistent messaging. The transaction batch size in >>> the producers is 1, and the message size for the tests in 1024 bytes. >>> No consumers are running at this time, and we let the queues build up. >>> Then the VirtualBox VM running the broker is 'powered off' (using >>> VirtualBox >>> facilities) 5 minutes along the way. The producers detect the broker's >>> absence and stop. >>> >>> >>> >>> Then we resume the VM and the broker. The broker starts up and we get >>> the queue stats from it before anything else: >>> >>> >>> >>> |NAME |ADDRESS |CONSUMER_COUNT >>> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED | >>> |testQueue1 |testQueue1 |0 >>> |106988 |106988 |0 |0 | >>> |testQueue2 |testQueue2 |0 >>> |107077 |107077 |0 |0 | >>> |testQueue3 |testQueue3 |0 >>> |106996 |106996 |0 |0 | >>> |testQueue4 |testQueue4 |0 >>> |107076 |107076 |0 |0 | >>> >>> >>> >>> The total message count across the queues is 428137. >>> >>> Now we start the consumers (no producers this time). Finally when the >>> consumers finish, we get the stats again. The consumers are claiming >>> that they received and acknowledged 428126 messages, which is >>> corroborated by the broker in the MESSAGES_ACKED column. >>> >>> >>> >>> |NAME |ADDRESS |CONSUMER_COUNT >>> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED | >>> >>> |testQueue1 |testQueue1 |0 |0 >>> |106988 |0 |106984 | >>> >>> |testQueue2 |testQueue2 |0 |0 >>> |107077 |0 |107074 | >>> >>> |testQueue3 |testQueue3 |0 |0 >>> |106996 |0 |106992 | >>> >>> |testQueue4 |testQueue4 |0 |0 >>> |107076 |0 |107076 | >>> >>> >>> >>> You can clearly see some apparent anomalies: >>> >>> 1) Post failure, and upon resumption, the broker said it had 428,137 >>> messages in the test queues, all combined (column MESSAGES_ADDED). >>> >>> 2) When the consumers consumed it got 428,126 messages and >>> acknowledged all of them. That is 11 short of 428,137. >>> >>> 3) The broker, upon the consumers' completion reported 0 queue >> depth, >>> but also said it got acknowledgements on 428,126 messages (column >>> MESSAGES_ACKED). >>> >>> >>> >>> Questions: >>> >>> 1) If we assume the 'MESSAGES_ADDED' column is accurate, then what >>> happed to additional 11 messages that the consumers never received, >>> and, as a result never acknowledged? >>> >>> 2) If, according to the broker, the number of acknowledged messages >>> is 11 less than the number of messages added to the queue, why did it >>> declare the queues to be empty when 11 of the messages were not >>> acknowledged? >>> >>> 3) If we trust the 'MESSAGES_ADDED' stats as a baseline number then >>> the system lost messages. And if we do not trust that statistic then >>> what do we trust, and how do we know if it lost messages? >>> >>> >>> >>> The system ran into this issue 3 out of 4 times I ran the VM power >>> failure test (with slightly different statistics, of course). We are >>> very concerned that it is symptom of message loss in the system, and >>> are also concerned about how to explain the anomalies. Will greatly >>> appreciate any pointer that can help us understand and address the >> underlying issue here. >>> >>> >>> >>> Thanks, >>> >>> Anindya Haldar >>> >>> Oracle Marketing Cloud >>> >>> >>> >>