Artemis 2.4.0 message loss in durability tests upon system power-off

Anindya Haldar Mon, 05 Feb 2018 17:14:06 -0800

We are in the process of qualifying Artemis 2.4.0 for our stack. We ran some 
message durability related tests in the face of a power failure. The broker is 
running in a VirtualBox VM, and is set up in a system where disk caching is 
disabled. The VM runs OEL Linux 7, and the VirtualBox Manger itself is running 
under Windows 7 Enterprise.


 

We use JMS API and persistent messaging. The transaction batch size in the 
producers is 1, and the message size for the tests in 1024 bytes. No consumers 
are running at this time, and we let the queues build up. Then the VirtualBox 
VM running the broker is 'powered off' (using VirtualBox facilities) 5 minutes 
along the way. The producers detect the broker's absence and stop.

 

Then we resume the VM and the broker. The broker starts up and we get the queue 
stats from it before anything else:

 

|NAME                     |ADDRESS                  |CONSUMER_COUNT 
|MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
|testQueue1               |testQueue1               |0              |106988     
   |106988         |0                |0              |
|testQueue2               |testQueue2               |0              |107077     
   |107077         |0                |0              |
|testQueue3               |testQueue3               |0              |106996     
   |106996         |0                |0              |
|testQueue4               |testQueue4               |0              |107076     
   |107076         |0                |0              |

 

The total message count across the queues is 428137.

Now we start the consumers (no producers this time). Finally when the consumers 
finish, we get the stats again. The consumers are claiming that they received 
and acknowledged 428126 messages, which is corroborated by the broker in the 
MESSAGES_ACKED column.

 

|NAME                     |ADDRESS                  |CONSUMER_COUNT 
|MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |

|testQueue1               |testQueue1               |0              |0          
   |106988         |0                |106984         |

|testQueue2               |testQueue2               |0              |0          
   |107077         |0                |107074         |

|testQueue3               |testQueue3               |0              |0          
   |106996         |0                |106992         |

|testQueue4               |testQueue4               |0              |0          
   |107076         |0                |107076         |

 

You can clearly see some apparent anomalies:

1)      Post failure, and upon resumption, the broker said it had 428,137 
messages in the test queues, all combined (column MESSAGES_ADDED).

2)      When the consumers consumed it got 428,126 messages and acknowledged 
all of them. That is 11 short of 428,137.

3)      The broker, upon the consumers' completion reported 0 queue depth, but 
also said it got acknowledgements on 428,126 messages (column MESSAGES_ACKED).

 

Questions:

1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what happed 
to additional 11 messages that the consumers never received, and, as a result 
never acknowledged?

2)      If, according to the broker, the number of acknowledged messages is 11 
less than the number of messages added to the queue, why did it declare the 
queues to be empty when 11 of the messages were not acknowledged?

3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then the 
system lost messages. And if we do not trust that statistic then what do we 
trust, and how do we know if it lost messages?

 

The system ran into this issue 3 out of 4 times I ran the VM power failure test 
(with slightly different statistics, of course). We are very concerned that it 
is symptom of message loss in the system, and are also concerned about how to 
explain the anomalies. Will greatly appreciate any pointer that can help us 
understand and address the underlying issue here.

 

Thanks,

Anindya Haldar

Oracle Marketing Cloud

Artemis 2.4.0 message loss in durability tests upon system power-off

Reply via email to