S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Jelez Raditchkov Fri, 04 Mar 2016 13:00:30 -0800

Working on a streaming job with DirectParquetOutputCommitter to S3I need to use 
PartitionBy and hence SaveMode.Append
Apparently when using SaveMode.Append spark automatically defaults to the 
default parquet output committer and ignores DirectParquetOutputCommitter.
My problems are:1. the copying to _temporary takes alot of time2. I get job 
failures with: java.io.FileNotFoundException: File 
s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does not 
exist.
I have set:        sparkConfig.set("spark.speculation", "false")        
sc.hadoopConfiguration.set("mapreduce.map.speculative", "false")         
sc.hadoopConfiguration.set("mapreduce.reduce.speculative", "false") 
Any ideas? Opinions? Best practices?

S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Reply via email to