Re: More instances = slower Spark job

Steve Loughran Sun, 01 Oct 2017 15:54:55 -0700

> On 28 Sep 2017, at 14:45, ayan guha <guha.a...@gmail.com> wrote:
> 
> Hi
> 
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text 
> file? Does it use InputFormat do create multiple splits and creates 1 
> partition per split?


Yes, Input formats give you their splits, this is usually used to decide how to 
break things up, As to how that gets used: you'll have to look at the source as 
I'll only get it wrong. Key point: it's part of the information which can be 
used to partition the work, but the number of available workers is the other 
big factor.



> Also, in case of S3 or NFS, how does the input split work? I understand for 
> HDFS files are already pre-split so Spark can use dfs.blocksize to determine 
> partitions. But how does it work other than HDFS?

there's invariably a config option to allow you tell spark what blocksize to 
work with, e.g fs.s3a.block.size ., which you set in spark defaults to 
something like

spark.hadoop.fs.s3a.block.size 67108864

to set it to 64MB. 

HDFS also provides locality information: where the data is. Other filesytems 
don't do that, they usually just say "localhost", which Spark recognises as 
"anywhere"...it schedules work on different parts of a file wherever there is 
free capacity.


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: More instances = slower Spark job

Reply via email to