SaveMode, parquet and S3

Peter Halliday Mon, 29 Feb 2016 19:35:35 -0800

I have a system where I’m saving parquet files to S3 via Spark.  They are 
partitioned a couple of ways first by date and then by a partition key.  There 
are multiple parquet files per combination over long period of time.  So the 
structure is like this:


s3://bucketname/date=2016-02-29/partionkey=2342/filename.parquet.gz

There’s been disagreement on how the SaveMode should be used for in saving out 
the data.  If we keep the SaveMode as ErrorIfExists, will that means additional 
partitions or parquet files that are written out later with the same parts of 
the subpath won’t be able to be written successfully?  Also, does the SaveMode 
apply to Tasks too.  Say, we are using the Direct Output Committer, and there’s 
a failure in a task that causes some files to be written and others in the task 
to not be written.  Would it automatically inherit the SaveMode in the 
individual file’s case.  or is the SaveMode only apply to the files as a whole?

Peter Halliday
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SaveMode, parquet and S3

Reply via email to