> On 28 Sep 2017, at 14:45, ayan guha <guha.a...@gmail.com> wrote: > > Hi > > Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text > file? Does it use InputFormat do create multiple splits and creates 1 > partition per split?
Yes, Input formats give you their splits, this is usually used to decide how to break things up, As to how that gets used: you'll have to look at the source as I'll only get it wrong. Key point: it's part of the information which can be used to partition the work, but the number of available workers is the other big factor. > Also, in case of S3 or NFS, how does the input split work? I understand for > HDFS files are already pre-split so Spark can use dfs.blocksize to determine > partitions. But how does it work other than HDFS? there's invariably a config option to allow you tell spark what blocksize to work with, e.g fs.s3a.block.size ., which you set in spark defaults to something like spark.hadoop.fs.s3a.block.size 67108864 to set it to 64MB. HDFS also provides locality information: where the data is. Other filesytems don't do that, they usually just say "localhost", which Spark recognises as "anywhere"...it schedules work on different parts of a file wherever there is free capacity. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org