Adding this simple setting helped me overcome the issue - 

*spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
*
My Issue - 

In a S3 Folder, I previously had data partitionedBy - *ingestiontime* .
Now I wanted to reprocess this data and partition it by - 
businessname & ingestiontime.

Whenever I was writing my dataframe in OverWrite Mode, 
All my data, which was present prior to this operation were
TRUNCATED/DELETED.

After setting the above spark configuration,
Only the required Partitions are being truncated and overwritten and all
others stay Intact.

In addition to this, if you have hadoop Trash Enabled, then you might be
able to fetch this lost data back.
For more -
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#File_Deletes_and_Undeletes



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to