Re: StreamingFileSink seems to be overwriting existing part files

Kostas Kloudas Fri, 29 Mar 2019 08:57:10 -0700

No problem!

Cheers,
Kostas


On Fri, Mar 29, 2019 at 4:38 PM Bruno Aranda <[email protected]> wrote:

> Hi Kostas,
>
> Put that way, sounds fair enough. Many thanks for the clarification,
>
> Cheers,
>
> Bruno
>
> On Fri, 29 Mar 2019 at 15:32, Kostas Kloudas <[email protected]> wrote:
>
>> Hi Bruno,
>>
>> This is the expected behaviour as the job starts "fresh", given that you
>> did not specify any savepoint/checkpoint to start from.
>>
>> As for the note that "One would expect that it finds the last part and
>> gets the next free number?",
>> I am not sure how this can be achieved safely and efficiently in an
>> eventually consistent object store like s3.
>> This is actually the reason why, contrary to the BucketingSink, the
>> StreamingFileSink relies on Flink's own state to determine the "next" part
>> counter.
>>
>> Cheers,
>> Kostas
>>
>> On Fri, Mar 29, 2019 at 4:22 PM Bruno Aranda <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> One of the main reasons we moved to version 1.7 (and 1.7.2 in
>>> particular) was because of the possibility of using a StreamingFileSink
>>> with S3.
>>>
>>> We've configured a StreamingFileSink to use a DateTimeBucketAssigner to
>>> bucket by day. It's got a parallelism of 1 and is writing to S3 from an EMR
>>> cluster in AWS.
>>>
>>> We ran the job and after a few hours of activity, manually cancelled it
>>> through the jobmanager API. After confirming that a number of "part-0-x"
>>> files existed in S3 at the expected path, we then started the job again
>>> using the same invocation of the CLI "flink run..." command that was
>>> originally used to start it.
>>>
>>> It started writing data to S3 again, starting afresh from "part-0-0",
>>> which gradually overwrote the existing data.
>>>
>>> I can understand not having used a checkpoint gives no indication on
>>> where to resume, but the fact that it overwrites the existing files (as it
>>> starts to write to part-0.0 again) is surprising. One would expect that it
>>> finds the last part and gets the next free number?
>>>
>>> We're definitely using the flink-s3-fs-hadoop-1.7.2.jar and don't have
>>> the presto version on the classpath.
>>>
>>> Is this its expected behaviour? We have not seen this in the non
>>> streaming versions of the sink.
>>>
>>> Best regards,
>>>
>>> Bruno
>>>
>>

Re: StreamingFileSink seems to be overwriting existing part files

Reply via email to