RE: Helix parallelism

Kanak Biscuitwala Wed, 03 Sep 2014 21:55:09 -0700

Hi Maha,

The info field is meant to be lightweight progress metadata. If you need to 
store something more sophisticated, you can use 
HelixManager#getHelixPropertyStore for small (kilobytes) state, or you can 
store pointers in the property store to a different store that can handle more 
data.


Kanak

________________________________
> From: [email protected] 
> Subject: Fwd: Helix parallelism 
> Date: Wed, 3 Sep 2014 10:49:41 -0700 
> To: [email protected] 
> 
> Hi Kishore/kanak, 
> 
> Thanks much for the guidance, I have tested the feature with minimum 
> nodes it works as expected but have not done the fullest testing. 
> 
> I have a question, is there a way that I can get the resulting data 
> back to consolidate or aggregate in the client as an option along the 
> TaskResult object which has status and info? Say for example of 
> returning 1 or 2 kb of results from 5 task participants, as an optional 
> data object. Similar to the map-reduce concept but a real time basis so 
> given the client opportunity to do consolidate results. 
> 
> Regards, 
> Maha 
> 
> On Aug 22, 2014, at 8:05 AM, kishore g 
> <[email protected]<mailto:[email protected]>> wrote: 
> 
> 
> Not sure if you are subscribed to the mailing list 
> 
> ---------- Forwarded message ---------- 
> From: "Kanak Biscuitwala" <[email protected]<mailto:[email protected]>> 
> Date: Aug 21, 2014 10:02 AM 
> Subject: RE: Helix parallelism 
> To: "[email protected]<mailto:[email protected]>" 
> <[email protected]<mailto:[email protected]>> 
> Cc: 
> 
> Yes, you can use the task framework, which hasn't been released yet, 
> but will be soon. For more on the task framework, you can read this 
> blog 
> post: 
> http://engineering.linkedin.com/distributed-systems/ad-hoc-task-management-apache-helix
>  
> 
> You can submit a job with 1000 tasks using either Java or YAML. 
> 
> The YAML specification of this job would look something like: 
> 
> 
> name: MyWorkflow 
> jobs: 
> - name: RunQueries 
> 
> 
> command: RunQuery # The command corresponding to Task callbacks 
> 
> jobConfigMap: { # Arbitrary key-value pairs to pass to all tasks in this job 
> 
> k1: "v1", 
> k2: "v2" 
> 
> 
> } 
> 
> numConcurrentTasksPerInstance: 200 # Max parallelism per instance 
> 
> tasks: # Schedule 1000 tasks, each responsible for aggregating requests 
> for a chunk of partitions 
> - taskConfigMap: { # Arbitrary key-value pairs to pass to this task 
> query: "query1" 
> } 
> - taskConfigMap: { 
> query: "query2" 
> } 
> - taskConfigMap: { 
> query: "query3" 
> } # Repeat for remaining 997 tasks 
> 
> 
> You can also see this class for an example of how to build jobs in 
> Java: 
> https://github.com/apache/helix/blob/master/helix-core/src/test/java/org/apache/helix/integration/task/TestIndependentTaskRebalancer.java
>  
> 
> Then you just need to implement a Task callback and register it on each 
> of the instances, and Helix will take care of assignment and retries. 
> ________________________________ 
> Date: Thu, 21 Aug 2014 09:07:11 -0700 
> Subject: Helix parallelism 
> From: [email protected]<mailto:[email protected]> 
> To: [email protected]<mailto:[email protected]> 
> 
> Hi, 
> 
> I just started looking at the capability that helix can do Parallelism 
> executing task evenly in the cluster instances, resources. 
> 
> I have a requirement in executing different queries but in parallel to 
> solve some issue. Can helix help in this case? 
> 
> For example 
> 1. I have some 1000 different queries to be executed. 
> 2. I have 5 nodes configured in the helix cluster capable of executing 
> set of queries. 
> 3. I need helix to distribute these 1000 different queries equally to 
> the 5 nodes (200 per node) and takes care re-executing failed set of 
> queries. And notifies the controller about the job done. 
> 
> Can someone help me in understand how helix can solve this kind of issue? 
> 
> Regards, 
> Maha

RE: Helix parallelism

Reply via email to