Working on a streaming job with DirectParquetOutputCommitter to S3I need to use
PartitionBy and hence SaveMode.Append
Apparently when using SaveMode.Append spark automatically defaults to the
default parquet output committer and ignores DirectParquetOutputCommitter.
My problems are:1. the copying to _temporary takes alot of time2. I get job
failures with: java.io.FileNotFoundException: File
s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does not
exist.
I have set: sparkConfig.set("spark.speculation", "false")
sc.hadoopConfiguration.set("mapreduce.map.speculative", "false")
sc.hadoopConfiguration.set("mapreduce.reduce.speculative", "false")
Any ideas? Opinions? Best practices?