Re: Spark driver locality

Rishitesh Mishra Fri, 28 Aug 2015 01:56:16 -0700

Hi Swapnil,

1. All the task scheduling/retry happens from Driver. So you are right that
a lot of communication happens from driver to cluster. It all depends on
the how you want to go about your Spark application, whether your
application has direct access to Spark cluster or its routed through a
gateway machine. Accordingly you can take your decision.


2. I am not familiar with NFS layer concurrency. But parallel reads should
be OK I think. Some one with the knowledge of NFS workings should correct
if I am wrong.


On Fri, Aug 28, 2015 at 1:12 AM, Swapnil Shinde <swapnilushi...@gmail.com>
wrote:

> Thanks Rishitesh !!
> 1. I get that driver doesn't need to be on master but there is lot of
> communication between driver and cluster. That's why co-located gateway was
> recommended. How much is the impact of driver not being co-located with
> cluster?
>
> 4. How does hdfs split get assigned to worker node to read data from
> remote hadoop cluster? I am more interested to know how mapr NFS layer is
> accessed in parallel.
>
> -
> Swapnil
>
>
> On Thu, Aug 27, 2015 at 2:53 PM, Rishitesh Mishra <
> rishi80.mis...@gmail.com> wrote:
>
>> Hi Swapnil,
>> Let me try to answer some of the questions. Answers inline. Hope it helps.
>>
>> On Thursday, August 27, 2015, Swapnil Shinde <swapnilushi...@gmail.com>
>> wrote:
>>
>>> Hello
>>> I am new to spark world and started to explore recently in standalone
>>> mode. It would be great if I get clarifications on below doubts-
>>>
>>> 1. Driver locality - It is mentioned in documentation that "client"
>>> deploy-mode is not good if machine running "spark-submit" is not co-located
>>> with worker machines. cluster mode is not available with standalone
>>> clusters. Therefore, do we have to submit all applications on master
>>> machine? (Assuming we don't have separate co-located gateway machine)
>>>
>>
>> No. In standalone mode also your master and driver machines can be
>> different.
>>
>>> Driver should have access to Master as well as worker machines.
>>>
>>
>>
>>> 2. How does above driver locality work with spark shell running on local
>>> machine ?
>>>
>>
>> Spark shell itself acts as driver. This means your local machine should
>> have access to all the cluster machines.
>>
>>>
>>> 3. I am little confused with role of driver program. Does driver do any
>>> computation in spark app life cycle? For instance, in simple row count app,
>>> worker node calculates local row counts. Does driver sum up local row
>>> counts? In short where does reduce phase runs in this case?
>>>
>>
>> Role of driver is to co-ordinate with cluster manager for initial
>> resource allocation. After that it needs to schedule tasks to different
>> executors assigned to it. It does not do any computation.(unless the
>> application itself does something on its own ). Reduce phase is also a
>> bunch of tasks, which gets assigned to one or more executors.
>>
>>>
>>> 4. In case of accessing hdfs data over network, do worker nodes read
>>> data in parallel? How does hdfs data over network get accessed in spark
>>> application?
>>>
>>
>>
>>> Yes. All worker will get a split to read. They read their own split in
>>> parallel.This means all worker nodes should have access to Hadoop file
>>> system.
>>>
>>
>>
>>> Sorry if these questions were already discussed..
>>>
>>> Thanks
>>> Swapnil
>>>
>>
>

Re: Spark driver locality

Reply via email to