Hi Kishore/kanak, Thanks much for the guidance, I have tested the feature with minimum nodes it works as expected but have not done the fullest testing.
I have a question, is there a way that I can get the resulting data back to consolidate or aggregate in the client as an option along the TaskResult object which has status and info? Say for example of returning 1 or 2 kb of results from 5 task participants, as an optional data object. Similar to the map-reduce concept but a real time basis so given the client opportunity to do consolidate results. Regards, Maha On Aug 22, 2014, at 8:05 AM, kishore g <[email protected]> wrote: Not sure if you are subscribed to the mailing list ---------- Forwarded message ---------- From: "Kanak Biscuitwala" <[email protected]> Date: Aug 21, 2014 10:02 AM Subject: RE: Helix parallelism To: "[email protected]" <[email protected]> Cc: Yes, you can use the task framework, which hasn't been released yet, but will be soon. For more on the task framework, you can read this blog post: http://engineering.linkedin.com/distributed-systems/ad-hoc-task-management-apache-helix You can submit a job with 1000 tasks using either Java or YAML. The YAML specification of this job would look something like: name: MyWorkflow jobs: - name: RunQueries command: RunQuery # The command corresponding to Task callbacks jobConfigMap: { # Arbitrary key-value pairs to pass to all tasks in this job k1: "v1", k2: "v2" } numConcurrentTasksPerInstance: 200 # Max parallelism per instance tasks: # Schedule 1000 tasks, each responsible for aggregating requests for a chunk of partitions - taskConfigMap: { # Arbitrary key-value pairs to pass to this task query: "query1" } - taskConfigMap: { query: "query2" } - taskConfigMap: { query: "query3" } # Repeat for remaining 997 tasks You can also see this class for an example of how to build jobs in Java: https://github.com/apache/helix/blob/master/helix-core/src/test/java/org/apache/helix/integration/task/TestIndependentTaskRebalancer.java Then you just need to implement a Task callback and register it on each of the instances, and Helix will take care of assignment and retries. Date: Thu, 21 Aug 2014 09:07:11 -0700 Subject: Helix parallelism From: [email protected] To: [email protected] Hi, I just started looking at the capability that helix can do Parallelism executing task evenly in the cluster instances, resources. I have a requirement in executing different queries but in parallel to solve some issue. Can helix help in this case? For example 1. I have some 1000 different queries to be executed. 2. I have 5 nodes configured in the helix cluster capable of executing set of queries. 3. I need helix to distribute these 1000 different queries equally to the 5 nodes (200 per node) and takes care re-executing failed set of queries. And notifies the controller about the job done. Can someone help me in understand how helix can solve this kind of issue? Regards, Maha
