How to use HDFS >3.1.1 with spark 2.3.3 to output parquet files to S3?

Alexander Czech Sun, 14 Jul 2019 15:11:42 -0700

As the subject suggest I want to output an parquet to S3. I know this was
rather troublesome in the past because of S3 not having a move but needed
to do a copy+delete.
This issues has been discussed before see:
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-files-to-s3-with-out-temporary-directory-tc28088.html


Now Hadoop-13786 <https://issues.apache.org/jira/browse/HADOOP-13786> is
fixing this problem in Hadoop 3.1.0 and later. How can I use that with
spark 2.3.3? I usually orchestrate my cluster on EC2 with flintrock
<https://github.com/nchammas/flintrock>. Do I just set in the flintrock
config HDFS to 3.1.1 and everything "just works"? Or do I also have to set
a committer algorithm like this when I create my spark context in pyspark:

.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','some_kind_of_Version')

thanks for the help!

How to use HDFS >3.1.1 with spark 2.3.3 to output parquet files to S3?

Reply via email to