You can't simply leave an element in the queue until a consumer finishes processing it, otherwise multiple consumers may end up processing it. What about the following:

- Use a failure detector to detect which consumers are up;
- Before removing an element from the queue, a consumer creates a znode (can't be ephemeral) flagging that the consumer is processing that element, that I'll call "/processing/pr"; - Once the consumer is done with the element, it removes "/processing/ pr"; - There is a watchdog client that periodically verifies if the list of elements being processed contains any element being processed by a crashed consumer. In that case, this client places the element back in the queue. This process needs to be careful not to generate duplicates as it is possible that the consumer crashes before it has time to delete the element from the queue and after it has created "/ processing/pr".

I'm not sure if there is a more efficient solution because now you need to implement a failure detection mechanism through zookeeper (or externally if you prefer) and an extra watchdog process.

-Flavio

On Jan 8, 2009, at 4:15 PM, Stuart White wrote:

I'm interested in using ZooKeeper to provide a distributed
producer/consumer queue for my distributed application.

Of course I've been studying the recipes provided for queues, barriers, etc...

My question is: how can I prevent packets of work from being lost if a
process crashes?

For example, following the distributed queue recipe, when a consumer
takes an item from the queue, it removes the first "item" znode under
the "queue" znode.  But, if the consumer immediately crashes after
removing the item from the queue, that item is lost.

Is there a recipe or recommended approach to ensure that no queue
items are lost in the event of process failure?

Thanks!

Reply via email to