It's hard to say, there are a number of variables. Some things to think about: Are the tasks idempotent? do they have leases (like SQS)? Is one process responsible for processing the tasks or will you have many vying for the jobs? Are the tasks ordered by creation date, or weighted by some factor? If processing for a task fails should another processor start processing, or drop the task, or move the task to a failed list? (to guard against totally blocking processing if 2 tasks are continually failing due to say, an error in the processing code). etc...

A simple approach might be to have a single queue of tasks:
http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Queues

where:
1) your task processors look for the first available task
2) if found they create a ephemeral node as a child of the task node
  (if the processor dies the ephemeral node will be removed)
3) the processor processes the task then deletes the task when "done"

the ephemeral created in 2) indicates whether a task is available or not

processors set watches on un-available tasks (the watch is on the ephemeral), and re-run 1) when the watch eventually triggers
(hint, you have to use exists("task/child", true) for the available check)

Obv if 3 is partially successful (ie you process the task and update, but fail before deleting the task node) then non-idempotence is going to be an issue. There are probably other considerations as well as the short list I gave above.


This sounds like a useful recipe to include in src/recipes if you are in a position to contribute back.

Regards,

Patrick

Alexander Sibiryakov wrote:
Hello everybody!
Please give me advice about designing application on zookeeper. I'd like to implement queue with limit on number of simultaneous tasks. For example I have 10 tasks, and I can process only 2 tasks simultaneously. When one task is finished processing, system should start another, supporting the number of task being in processing state within 2. Thanks.


Reply via email to