Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

Sabarish Sasidharan Thu, 24 Sep 2015 03:10:18 -0700

A little caution is needed as one executor per node may not always be ideal
esp when your nodes have lots of RAM. But yes, using lesser number of
executors has benefits like more efficient broadcasts.


Regards
Sab
On 24-Sep-2015 2:57 pm, "Adrian Tanase" <atan...@adobe.com> wrote:

> RE: # because I already have a bunch of InputSplits, do I still need to
> specify the number of executors to get processing parallelized?
>
> I would say it’s best practice to have as many executors as data nodes and
> as many cores as you can get from the cluster – if YARN has enough
>  resources it will deploy the executors distributed across the cluster,
> then each of them will try to process the data locally (check the spark ui
> for NODE_LOCAL), with as many splits in parallel as you defined in
> spark.executor.cores
>
> -adrian
>
> From: Sandy Ryza
> Date: Thursday, September 24, 2015 at 2:43 AM
> To: Anfernee Xu
> Cc: "user@spark.apache.org"
> Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark
> executors/task and Yarn containers
>
> Hi Anfernee,
>
> That's correct that each InputSplit will map to exactly a Spark partition.
>
> On YARN, each Spark executor maps to a single YARN container.  Each
> executor can run multiple tasks over its lifetime, both parallel and
> sequentially.
>
> If you enable dynamic allocation, after the stage including the
> InputSplits gets submitted, Spark will try to request an appropriate number
> of executors.
>
> The memory in the YARN resource requests is --executor-memory + what's set
> for spark.yarn.executor.memoryOverhead, which defaults to 10% of
> --executor-memory.
>
> -Sandy
>
> On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu <anfernee...@gmail.com>
> wrote:
>
>> Hi Spark experts,
>>
>> I'm coming across these terminologies and having some confusions, could
>> you please help me understand them better?
>>
>> For instance I have implemented a Hadoop InputFormat to load my external
>> data in Spark, in turn my custom InputFormat will create a bunch of
>> InputSplit's, my questions is about
>>
>> # Each InputSplit will exactly map to a Spark partition, is that correct?
>>
>> # If I run on Yarn, how does Spark executor/task map to Yarn container?
>>
>> # because I already have a bunch of InputSplits, do I still need to
>> specify the number of executors to get processing parallelized?
>>
>> # How does -executor-memory map to the memory requirement in Yarn's
>> resource request?
>>
>> --
>> --Anfernee
>>
>
>

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

Reply via email to