On 16 Nov 2016, at 22:34, Edden Burrow 
<eddenbur...@gmail.com<mailto:eddenbur...@gmail.com>> wrote:

Anyone dealing with a lot of files with spark?  We're trying s3a with 2.0.1 
because we're seeing intermittent errors in S3 where jobs fail and saveAsText 
file fails. Using pyspark.

How many files? Thousands? Millions?

If you do have some big/complex file structure, I'd really like to know; it not 
only helps us make sure that spark/hive metastore/s3a can handle the layout, it 
may help improve some advice on what not to do.


Is there any issue with working in a S3 folder that has too many files?  How 
about having versioning enabled? Are these things going to be a problem?

Many, many files shouldn't be a problem, except for slowing down some 
operations, and creating larger memory structures to get passed round. 
Partitioning can get slow.


We're pre-building the s3 file list and storing it in a file and passing that 
to textFile as a long comma separated list of files - So we are not running 
list files.

But we get errors with saveAsText file, related to ListBucket.  Even though 
we're not using wildcard '*'.

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: 
Failed to parse XML document with handler class 
org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler


at a guess, it'll be some checks before the write that the parent directory 
exists and the destination path isn't a directory.


Running spark 2.0.1 with the s3a protocol.

Not with a stack trace containing org.jets3t you aren't. That's what you'd 
expect for s3 and s3n; the key feature of s3a is moving onto the amazon SDK, 
where stack traces move to com.amazon classes

Make sure you *are* using s3a, ideally on Hadoop 2.7.x  (or even better, HDP 
2.5 where you get all the Hadoop 2.8 read pipeline optimisations) On Hadoop 
2.6.x there were still some stabilisation issues that only surfaced in the wild.

Some related slides 
http://www.slideshare.net/steve_l/apache-spark-and-object-stores

-Steve

Reply via email to