Re: queue with limit to number of simultaneous tasks

2009-07-27 Thread Alexander Sibiryakov
Thank you for this thoughts. It's help to look at my problem at 
different view. It seems, that I can use slightly modified lock recipe.


Ted Dunning wrote:

As you look at this, I would be grateful if you can evaluate alternative
implementations in which

a) each task is a separate file

or

b) all tasks are listed and described in a single file that is updated
atomically using standard ZK read-modify-write-repeat-on-failure style

or

c) all tasks are listed in a single file, but their descriptions are kept in
separate files whose names are in the single file.  Atomic updates occur to
the single file, task files are cleaned up as well as possible.  And task
files that are not deleted in good order (should be exceedingly rare) can be
recognized by lack of a reference from the single control file.


The trade-offs here occurs with large numbers of running tasks, large
numbers of pending tasks or very high task churn rates.  Option (a) becomes
very bad with many pending tasks because selecting a task may have server
round trips proportional to number of pending tasks.  Option (b) might
exceed the maximum file size for moderate number of tasks.  Option (c) seems
safe except for the occasional need for garbage cleanup if programs fail
between updating the control file and deleting the task files.  Mostly
people talk about (a), but (c) seems very competitive to me.

All of these alternatives simply implement the "look for" verb in Patrick's
excellent outline.  What he suggests for task working convention is quite
reasonable.

On Tue, Jul 14, 2009 at 9:45 AM, Patrick Hunt  wrote:


1) your task processors look for the first available task
2) if found they create a ephemeral node as a child of the task node
 (if the processor dies the ephemeral node will be removed)
3) the processor processes the task then deletes the task when "done"






--
AS


Re: queue with limit to number of simultaneous tasks

2009-07-14 Thread Ted Dunning
As you look at this, I would be grateful if you can evaluate alternative
implementations in which

a) each task is a separate file

or

b) all tasks are listed and described in a single file that is updated
atomically using standard ZK read-modify-write-repeat-on-failure style

or

c) all tasks are listed in a single file, but their descriptions are kept in
separate files whose names are in the single file.  Atomic updates occur to
the single file, task files are cleaned up as well as possible.  And task
files that are not deleted in good order (should be exceedingly rare) can be
recognized by lack of a reference from the single control file.


The trade-offs here occurs with large numbers of running tasks, large
numbers of pending tasks or very high task churn rates.  Option (a) becomes
very bad with many pending tasks because selecting a task may have server
round trips proportional to number of pending tasks.  Option (b) might
exceed the maximum file size for moderate number of tasks.  Option (c) seems
safe except for the occasional need for garbage cleanup if programs fail
between updating the control file and deleting the task files.  Mostly
people talk about (a), but (c) seems very competitive to me.

All of these alternatives simply implement the "look for" verb in Patrick's
excellent outline.  What he suggests for task working convention is quite
reasonable.

On Tue, Jul 14, 2009 at 9:45 AM, Patrick Hunt  wrote:

> 1) your task processors look for the first available task
> 2) if found they create a ephemeral node as a child of the task node
>  (if the processor dies the ephemeral node will be removed)
> 3) the processor processes the task then deletes the task when "done"
>


Re: queue with limit to number of simultaneous tasks

2009-07-14 Thread Patrick Hunt
It's hard to say, there are a number of variables. Some things to think 
about: Are the tasks idempotent? do they have leases (like SQS)? Is one 
process responsible for processing the tasks or will you have many vying 
for the jobs? Are the tasks ordered by creation date, or weighted by 
some factor? If processing for a task fails should another processor 
start processing, or drop the task, or move the task to a failed list? 
(to guard against totally blocking processing if 2 tasks are continually 
failing due to say, an error in the processing code). etc...


A simple approach might be to have a single queue of tasks:
http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_recipes_Queues

where:
1) your task processors look for the first available task
2) if found they create a ephemeral node as a child of the task node
  (if the processor dies the ephemeral node will be removed)
3) the processor processes the task then deletes the task when "done"

the ephemeral created in 2) indicates whether a task is available or not

processors set watches on un-available tasks (the watch is on the 
ephemeral), and re-run 1) when the watch eventually triggers

(hint, you have to use exists("task/child", true) for the available check)

Obv if 3 is partially successful (ie you process the task and update, 
but fail before deleting the task node) then non-idempotence is going to 
be an issue. There are probably other considerations as well as the 
short list I gave above.



This sounds like a useful recipe to include in src/recipes if you are in 
a position to contribute back.


Regards,

Patrick

Alexander Sibiryakov wrote:

Hello everybody!
Please give me advice about designing application on zookeeper. I'd like 
to implement queue with limit on number of simultaneous tasks. For 
example I have 10 tasks, and I can process only 2 tasks simultaneously. 
When one task is finished processing, system should start another, 
supporting the number of task being in processing state within 2. Thanks.