Re: Preferred locations (or data locality) for batch pipelines.

2016-10-03 Thread Amit Sela
You're right on the money Dan! I'll just add a couple of more things: 1. HDFS API can also help out with HBase locality (if the RegionServer is running on the same node etc.) 2. Other filesystems such as S3 and GS have "connectors" that allow users to use the Hadoop API with the "s3/g

Re: Preferred locations (or data locality) for batch pipelines.

2016-10-03 Thread Dan Halperin
See if this is a right interpretation: * Hadoop's InputSplit has a getLocations method that in some cases exposes useful information about the underlying data locality. * Beam jobs may run on the sa

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-26 Thread Amit Sela
Thanks for the through response Dan, what you mentioned is very interesting and would clearly benefit runners. I was actually talking about something more "old-school", and specific to batch. If running a job on YARN - via MapReduce, Spark, etc. - you'd prefer that YARN would assign tasks working

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-26 Thread Dan Halperin
Hi Amit, Sorry to be late to the thread, but I've been traveling. I'm not sure I fully grokked the question, but here's one attempt at an answer: In general, any options on where a pipeline is executed should be runner-specific. One example: for Dataflow, we have the zone

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
Generally this makes sense, though I thought that this is what IOChannelFactory was (also) about, and eventually the runner needs to facilitate the splitting/partitioning of the source, so I was wondering if the source could have a generic mechanism for locality as well. On Thu, Sep 22, 2016 at 6:

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jesse Anderson
I think the runners should. Each framework has put far more effort into data locality than Beam should. Beam should just take advantage of it. On Thu, Sep 22, 2016, 7:57 AM Amit Sela wrote: > Not where in the file, where in the cluster. > > Like you said - mapper - in MapReduce the mapper instan

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
Not where in the file, where in the cluster. Like you said - mapper - in MapReduce the mapper instance will *prefer* to start on the same machine as the Node hosting it (unless that's changed, I've been out of touch with MR for a while...). And for Spark - https://databricks.gitbooks.io/databrick

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jesse Anderson
I've only ever seen that being used to figure out which file the runner/mapper/operation is working on. Otherwise, I haven't seen those operations care where in the file they're working. On Thu, Sep 22, 2016 at 5:57 AM Amit Sela wrote: > Wouldn't it force all runners to implement this for all di

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
Wouldn't it force all runners to implement this for all distributed filesystems ? It's true that each runner has it's own "partitioning" mechanism, but I assume (maybe I'm wrong) that open-source runners use the Hadoop InputFormat/InputSplit for that.. and the proper connectors for that to run on t

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jean-Baptiste Onofré
Hi Amit, as the purpose is to remove IOChannelFactory, then I would suggest it's a runner concern. The Read.Bounded should "locate" the bundles on a executor close to the read data (even if it's not always possible depending of the source). My $0.01 Regards JB On 09/22/2016 02:26 PM, Amit

Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
It's not new that batch pipeline can optimize on data locality, my question is regarding this responsibility in Beam. If runners should implement a generic Read.Bounded support, should they also implement locating the input blocks ? or should it be a part of IOChannelFactory implementations ? or an