From: Aaron Davidson [mailto:ilike...@gmail.com]
Sent: Tuesday, March 17, 2015 3:06 PM
To: Imran Rashid
Cc: Shuai Zheng; user@spark.apache.org
Subject: Re: Spark will process _temporary folder on S3 is very slow and always
cause failure
Actually, this is the more relevant JIRA (which
Actually, this is the more relevant JIRA (which is resolved):
https://issues.apache.org/jira/browse/SPARK-3595
6352 is about saveAsParquetFile, which is not in use here.
Here is a DirectOutputCommitter implementation:
https://gist.github.com/aarondav/c513916e72101bbe14ec
and it can be configured
I'm not super familiar w/ S3, but I think the issue is that you want to use
a different output committers with "object" stores, that don't have a
simple move operation. There have been a few other threads on S3 &
outputcommitters. I think the most relevant for you is most probably this
open JIRA:
If you use fileStream, there's an option to filter out files. In your case
you can easily create a filter to remove _temporary files. In that case,
you will have to move your codes inside foreachRDD of the dstream since the
application will become a streaming app.
Thanks
Best Regards
On Sat, Mar
And one thing forget to mention, even I have this exception and the result
is not well format in my target folder (part of them are there, rest are
under different folder structure of _tempoary folder). In the webUI of
spark-shell, it is still be marked as successful step. I think this is a
bug?