We can just add back a flag to make it backwards compatible - it was just missed during the original PR.
Adding a *third* set of "clobber" semantics, I'm slightly -1 on that for the following reasons: 1. It's scary to have Spark recursively deleting user files, could easily lead to users deleting data by mistake if they don't understand the exact semantics. 2. It would introduce a third set of semantics here for saveAsXX... 3. It's trivial for users to implement this with two lines of code (if output dir exists, delete it) before calling saveAsHadoopFile. - Patrick On Mon, Jun 2, 2014 at 2:49 PM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Ah yes, this was indeed intended to have been taken care of: > >> add some new APIs with a flag for users to define whether he/she wants to >> overwrite the directory: if the flag is set to true, then the output >> directory is deleted first and then written into the new data to prevent the >> output directory contains results from multiple rounds of running; > > > > On Mon, Jun 2, 2014 at 5:47 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote: >> >> I made the PR, the problem is …after many rounds of review, that >> configuration part is missed….sorry about that >> >> I will fix it >> >> Best, >> >> -- >> Nan Zhu >> >> On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote: >> >> I'm a bit confused because the PR mentioned by Patrick seems to adress all >> these issues: >> >> https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 >> >> Was it not accepted? Or is the description of this PR not completely >> implemented? >> >> Message sent from a mobile device - excuse typos and abbreviations >> >> Le 2 juin 2014 à 23:08, Nicholas Chammas <nicholas.cham...@gmail.com> a >> écrit : >> >> OK, thanks for confirming. Is there something we can do about that >> leftover part- files problem in Spark, or is that for the Hadoop team? >> >> >> 2014년 6월 2일 월요일, Aaron Davidson<ilike...@gmail.com>님이 작성한 메시지: >> >> Yes. >> >> >> On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas >> <nicholas.cham...@gmail.com> wrote: >> >> So in summary: >> >> As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default. >> There is an open JIRA issue to add an option to allow clobbering. >> Even when clobbering, part- files may be left over from previous saves, >> which is dangerous. >> >> Is this correct? >> >> >> On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson <ilike...@gmail.com> wrote: >> >> +1 please re-add this feature >> >> >> On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell <pwend...@gmail.com> >> wrote: >> >> Thanks for pointing that out. I've assigned you to SPARK-1677 (I think >> I accidentally assigned myself way back when I created it). This >> should be an easy fix. >> >> On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote: >> > Hi, Patrick, >> > >> > I think https://issues.apache.org/jira/browse/SPARK-1677 is talking >> > about >> > the same thing? >> > >> > How about assigning it to me? >> > >> > I think I missed the configuration part in my previous commit, though I >> > declared that in the PR description.... >> > >> > Best, >> > >> > -- >> > Nan Zhu >> > >> > On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote: >> > >> > Hey There, >> > >> > The issue was that the old behavior could cause users to silently >> > overwrite data, which is pretty bad, so to be conservative we decided >> > to enforce the same checks that Hadoop does. >> > >> > This was documented by this JIRA: >> > https://issues.apache.org/jira/browse/SPARK-1100 >> > >> > https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 >> > >> > However, it would be very easy to add an option that allows preserving >> > the old behavior. Is anyone here interested in contributing that? I >> > created a JIRA for it: >> > >> > https://issues.apache.org/jira/browse/SPARK-1993 >> > >> > - Patrick >> > >> > On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans >> > <pierre.borckm...@realimpactanalytics.com> wrote: >> > >> > Indeed, the behavior has changed for good or for bad. I mean, I agree >> > with >> > the danger you mention but I'm not sure it's happening like that. Isn't >> > there a mechanism for overwrite in Hadoop that automatically removes >> > part >> > files, then writes a _temporary folder and then only the part files >> > along >> > with the _success folder. >> > >> > In any case this change of behavior should be documented IMO. >> > >> > Cheers >> > Pierre >> > >> > Message sent from a mobile device - excuse typos and abbreviations >> > >> > Le 2 juin 2014 à 17:42, Nicholas Chammas <nicholas.cham...@gmail.com> a >> > écrit : >> > >> > What I've found using saveAsTextFile() against S3 (prior to Spark >> > 1.0.0.) is >> > that files get overwritten automatically. This is one danger to this >> > though. >> > If I save to a directory that already has 20 part- files, but this time >> > around I'm only saving 15 part- files, then there will be 5 leftover >> > part- >> > files from the previous set mixed in with the 15 newer files. This is >> > potentially dangerous. >> > >> > I haven't checked to see if this behavior has changed in 1.0.0. Are you >> >> >