Hi Kishore/kanak,

Thanks much for the guidance, I have tested the feature with minimum nodes it 
works as expected but have not done the fullest testing.

I have a question, is there a way that I can get the resulting data back to 
consolidate or aggregate in the client as an option along the TaskResult object 
which has status and info? Say for example of returning 1 or 2 kb of results 
from 5 task participants, as an optional data object. Similar to the map-reduce 
concept but a real time basis so given the client opportunity to do consolidate 
results.

Regards,
Maha

On Aug 22, 2014, at 8:05 AM, kishore g <[email protected]> wrote:

Not sure if you are subscribed to the mailing list

---------- Forwarded message ----------
From: "Kanak Biscuitwala" <[email protected]>
Date: Aug 21, 2014 10:02 AM
Subject: RE: Helix parallelism
To: "[email protected]" <[email protected]>
Cc: 

Yes, you can use the task framework, which hasn't been released yet, but will 
be soon. For more on the task framework, you can read this blog post: 
http://engineering.linkedin.com/distributed-systems/ad-hoc-task-management-apache-helix

You can submit a job with 1000 tasks using either Java or YAML.

The YAML specification of this job would look something like:

name: MyWorkflow
jobs:
    - name: RunQueries

      command: RunQuery # The command corresponding to Task callbacks

      jobConfigMap: { # Arbitrary key-value pairs to pass to all tasks in this 
job

        k1: "v1",
        k2: "v2"

      }
      numConcurrentTasksPerInstance: 200 # Max parallelism per instance

      tasks: # Schedule 1000 tasks, each responsible for aggregating requests 
for a chunk of partitions

        - taskConfigMap: { # Arbitrary key-value pairs to pass to this task

            query: "query1"
          }

        - taskConfigMap: {
            query: "query2"

          }
        - taskConfigMap: {

            query: "query3"
          } # Repeat for remaining 997 tasks



You can also see this class for an example of how to build jobs in Java: 
https://github.com/apache/helix/blob/master/helix-core/src/test/java/org/apache/helix/integration/task/TestIndependentTaskRebalancer.java

Then you just need to implement a Task callback and register it on each of the 
instances, and Helix will take care of assignment and retries.
Date: Thu, 21 Aug 2014 09:07:11 -0700
Subject: Helix parallelism
From: [email protected]
To: [email protected]

Hi,

I just started looking at the capability that helix can do Parallelism 
executing task evenly in the cluster instances, resources. 

I have a requirement in executing different queries but in parallel to solve 
some issue. Can helix help in this case?

For example
1. I have some 1000 different queries to be executed.
2. I have 5 nodes configured in the helix cluster capable of executing set of 
queries.
3. I need helix to distribute these 1000 different queries equally to the 5 
nodes (200 per node) and takes care re-executing failed set of queries. And 
notifies the controller about the job done.

Can someone help me in understand how helix can solve this kind of issue? 

Regards,
Maha

Reply via email to