Hi Ben,

With (Beam over) Spark over Yarn, data locality is taken into account when 
containers are created to process data.

So, in principle, say you have your cad file in hdfs on hostA, then the 
processing will likely happen on hostA or “near enough”.
Supported localities vary from same process, to same host, to same rack (if 
configured).
I’m using (virtual/fake) racks to specifies virtual machines/containers on the 
same physical host, and it works like a charm.

Be aware, though, that the whole spark+yarn model is to provide reliable 
computation, i.e. it tries to allocate resources where it thinks it’s best, but 
it may allocate them somewhere else. In my experiments, for instance, a minor 
percentage of tasks is allocated “far away” for no apparent reason. This is 
fine for my use case, not sure for yours.

In addition, you need to store your cad files “somewhere” that yarn 
understands, so for instance hdfs would be the natural choice.

Finally, note that I have experience with Spark over Yarn (actually, without 
Beam), but Flink also supports running on Yarn. I’m not sure if/how they manage 
data locality, but they probably do too.

Best,


> On May 22, 2016, at 2:01 PM, Stadin, Benjamin 
> <[email protected]> wrote:
> 
> Hi JB,
> 
> None so far. Iąm still thinking about how to achieve what I want to do,
> and whether Beam makes sense for my usage scenario.
> 
> Iąm mostly interested to just orchestrate tasks to individual machines and
> service endpoints, depending on their workload. My application is not so
> much about Big Data and parallelism, but local data processing and local
> parallelization. 
> 
> An example scenario:
> - A user uploads a set of CAD files
> - data from CAD files are extracted in parallel
> - a whole bunch of native tools operate on this extracted data set in an
> own pipe. Due to the amount of data generated and consumed, it doesnąt
> make sense at all to distribute these tasks to other machines. Itąs very
> IO bound. 
> - For the same reason, it doesnąt make sense to distribute data using RDD.
> Itąs rather favorable to do only some tasks (such as CAD data extraction)
> in parallel, otherwise run other data tasks as a group on a single node,
> in order to avoid IO bottle necks.
> 
> So I donąt have a typical Big Data processing in mind. What Iąm looking
> for is rather an integrated environment to provide only some kind of
> parallel task execution, and task management and administration, as well
> as a message bus and event system.
> 
> Is Beam a choice for such rather non-Big-Data scenario?
> 
> Regards,
> Ben
> 
> 
> Am 21.05.16, 18:59 schrieb "Jean-Baptiste Onofré" unter <[email protected]>:
> 
>> Hi Ben,
>> 
>> it's not SDK related, it's more depend on the runner.
>> 
>> What runner are you using ?
>> 
>> Regards
>> JB
>> 
>> On 05/21/2016 04:22 PM, Stadin, Benjamin wrote:
>>> Hi,
>>> 
>>> I need to control beam pipes/filters so that pipe executions that match
>>> a certain criteria are executed on the same node.
>>> 
>>> In Spring XD this can be controlled by defining groups
>>> 
>>> (http://docs.spring.io/spring-xd/docs/1.2.0.RELEASE/reference/html/#deplo
>>> yment)
>>> and then specify deployment criteria to match this group.
>>> 
>>> Is this possible with Beam?
>>> 
>>> Best
>>> Ben
>> 
>> -- 
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
> 

Reply via email to