Re: How to read a multipart s3 file?

paul Thu, 07 Aug 2014 09:52:44 -0700

darkjh wrote
> But in my experience, when reading directly from
> s3n, spark create only 1 input partition per file, regardless of the file
> size. This may lead to some performance problem if you have big files.


This is actually not true, Spark uses the underlying hadoop input formats to
read the files so if the input format you are using supports splittable
files (text, avro etc.) then it can use multiple splits per file (leading to
multiple map tasks per file).  You do have to set the max input split size,
as an example:

FileInputFormat.setMaxInputSplitSize(job, 256000000L)

In this case any file larger than 256,000,000 bytes is split.  If you don't
explicitly set it the limit is infinite which leads to the behavior you are
seeing where it is 1 split per file. 

Regards,
Paul Hamilton



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p11673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to read a multipart s3 file?

Reply via email to