Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Patrick Wendell Mon, 02 Jun 2014 16:23:24 -0700

We can just add back a flag to make it backwards compatible - it was
just missed during the original PR.


Adding a *third* set of "clobber" semantics, I'm slightly -1 on that
for the following reasons:

1. It's scary to have Spark recursively deleting user files, could
easily lead to users deleting data by mistake if they don't understand
the exact semantics.
2. It would introduce a third set of semantics here for saveAsXX...
3. It's trivial for users to implement this with two lines of code (if
output dir exists, delete it) before calling saveAsHadoopFile.

- Patrick

On Mon, Jun 2, 2014 at 2:49 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> Ah yes, this was indeed intended to have been taken care of:
>
>> add some new APIs with a flag for users to define whether he/she wants to
>> overwrite the directory: if the flag is set to true, then the output
>> directory is deleted first and then written into the new data to prevent the
>> output directory contains results from multiple rounds of running;
>
>
>
> On Mon, Jun 2, 2014 at 5:47 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
>>
>> I made the PR, the problem is …after many rounds of review, that
>> configuration part is missed….sorry about that
>>
>> I will fix it
>>
>> Best,
>>
>> --
>> Nan Zhu
>>
>> On Monday, June 2, 2014 at 5:13 PM, Pierre Borckmans wrote:
>>
>> I'm a bit confused because the PR mentioned by Patrick seems to adress all
>> these issues:
>>
>> https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
>>
>> Was it not accepted? Or is the description of this PR not completely
>> implemented?
>>
>> Message sent from a mobile device - excuse typos and abbreviations
>>
>> Le 2 juin 2014 à 23:08, Nicholas Chammas <nicholas.cham...@gmail.com> a
>> écrit :
>>
>> OK, thanks for confirming. Is there something we can do about that
>> leftover part- files problem in Spark, or is that for the Hadoop team?
>>
>>
>> 2014년 6월 2일 월요일, Aaron Davidson<ilike...@gmail.com>님이 작성한 메시지:
>>
>> Yes.
>>
>>
>> On Mon, Jun 2, 2014 at 1:23 PM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>>
>> So in summary:
>>
>> As of Spark 1.0.0, saveAsTextFile() will no longer clobber by default.
>> There is an open JIRA issue to add an option to allow clobbering.
>> Even when clobbering, part- files may be left over from previous saves,
>> which is dangerous.
>>
>> Is this correct?
>>
>>
>> On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson <ilike...@gmail.com> wrote:
>>
>> +1 please re-add this feature
>>
>>
>> On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell <pwend...@gmail.com>
>> wrote:
>>
>> Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
>> I accidentally assigned myself way back when I created it). This
>> should be an easy fix.
>>
>> On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
>> > Hi, Patrick,
>> >
>> > I think https://issues.apache.org/jira/browse/SPARK-1677 is talking
>> > about
>> > the same thing?
>> >
>> > How about assigning it to me?
>> >
>> > I think I missed the configuration part in my previous commit, though I
>> > declared that in the PR description....
>> >
>> > Best,
>> >
>> > --
>> > Nan Zhu
>> >
>> > On Monday, June 2, 2014 at 3:03 PM, Patrick Wendell wrote:
>> >
>> > Hey There,
>> >
>> > The issue was that the old behavior could cause users to silently
>> > overwrite data, which is pretty bad, so to be conservative we decided
>> > to enforce the same checks that Hadoop does.
>> >
>> > This was documented by this JIRA:
>> > https://issues.apache.org/jira/browse/SPARK-1100
>> >
>> > https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
>> >
>> > However, it would be very easy to add an option that allows preserving
>> > the old behavior. Is anyone here interested in contributing that? I
>> > created a JIRA for it:
>> >
>> > https://issues.apache.org/jira/browse/SPARK-1993
>> >
>> > - Patrick
>> >
>> > On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
>> > <pierre.borckm...@realimpactanalytics.com> wrote:
>> >
>> > Indeed, the behavior has changed for good or for bad. I mean, I agree
>> > with
>> > the danger you mention but I'm not sure it's happening like that. Isn't
>> > there a mechanism for overwrite in Hadoop that automatically removes
>> > part
>> > files, then writes a _temporary folder and then only the part files
>> > along
>> > with the _success folder.
>> >
>> > In any case this change of behavior should be documented IMO.
>> >
>> > Cheers
>> > Pierre
>> >
>> > Message sent from a mobile device - excuse typos and abbreviations
>> >
>> > Le 2 juin 2014 à 17:42, Nicholas Chammas <nicholas.cham...@gmail.com> a
>> > écrit :
>> >
>> > What I've found using saveAsTextFile() against S3 (prior to Spark
>> > 1.0.0.) is
>> > that files get overwritten automatically. This is one danger to this
>> > though.
>> > If I save to a directory that already has 20 part- files, but this time
>> > around I'm only saving 15 part- files, then there will be 5 leftover
>> > part-
>> > files from the previous set mixed in with the 15 newer files. This is
>> > potentially dangerous.
>> >
>> > I haven't checked to see if this behavior has changed in 1.0.0. Are you
>>
>>
>

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Reply via email to