Re: 回复: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring

2018-05-15 Thread Till Rohrmann
Hi Xinyu,

would it help to have a small tool which can truncate the finished files
which have a valid-length file associated? That way, one could use this
tool before others are using the data farther down stream.

Cheers,
Till

On Tue, May 15, 2018 at 3:05 PM, Xinyu Zhang <342689...@qq.com> wrote:

> Yes, I'm glad to do it. but I'm not sure writing a new file is a good
> solution. So I want to discuss it here.
> Do you have any ideas? @Kostas
>
>
>
>
> -- 原始邮件 --
> 发件人: "twalthr";
> 发送时间: 2018年5月15日(星期二) 晚上8:21
> 收件人: "Xinyu Zhang"<342689...@qq.com>;
> 抄送: "dev"; "kkloudas";
> 主题: Re: 回复: Rewriting a new file instead of writing a ".valid-length" file
> inBucketSink when restoring
>
>
>
> As far as I know, the bucketing sink is currenlty also limited by
> relying on Hadoops file system abstraction. It is planned to switch to
> Flink's file system abstraction which might also improve this situation.
> Kostas (in CC) might know more about it.
>
> But I think we can discuss if an other behavior should be configurable
> as well. Would you be willing to contribute?
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 14:01 schrieb Xinyu Zhang:
> > Thanks for your reply.
> > Indeed, if a file is very large, it will take a long time. However,
> > the the “.valid-length” file is not convenient for others who use the
> > data in HDFS.
> > Maybe we should provide a configuration for users to choose which
> > strategy they prefer.
> > Do you have any ideas?
> >
> >
> > -- 原始邮件 --
> > *发件人:* "Timo Walther";
> > *发送时间:* 2018年5月15日(星期二) 晚上7:30
> > *收件人:* "dev";
> > *主题:* Re: Rewriting a new file instead of writing a ".valid-length"
> > file inBucketSink when restoring
> >
> > I guess writing a new file would take much longer than just using the
> > .valid-length file, especially if the files are very large. The
> > restoring time should be as minimal as possible to ensure little
> > downtime on restarts.
> >
> > Regards,
> > Timo
> >
> >
> > Am 15.05.18 um 09:31 schrieb Gary Yao:
> > > Hi,
> > >
> > > The BucketingSink truncates the file if the Hadoop FileSystem
> > supports this
> > > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you
> > using?
> > >
> > > Best,
> > > Gary
> > >
> > > [1]
> > >
> > https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196
> b2fdaf85e0/flink-connectors/flink-connector-filesystem/
> src/main/java/org/apache/flink/streaming/connectors/fs/
> bucketing/BucketingSink.java#L301
> > >
> > > On Mon, May 14, 2018 at 1:37 PM, 张馨予 <342689...@qq.com> wrote:
> > >
> > >> Hi
> > >>
> > >>
> > >> I'm trying to copy data from kafka to HDFS . The data in HDFS is
> > used to
> > >> do other computations by others in map/reduce.
> > >> If some tasks failed, the ".valid-length" file is created for the low
> > >> version hadoop. The problem is other people must know how to deal
> > with the
> > >> ".valid-length" file, otherwise, the data may be not exactly-once.
> > >> Hence, why not rewrite a new file when restoring instead of writing a
> > >> ".valid-length" file. In this way, others who use the data in HDFS
> > don't
> > >> need to know how to deal with the ".valid-length" file.
> > >>
> > >>
> > >> Thanks!
> > >>
> > >>
> > >> Zhang Xinyu
> >
>


Re: 回复: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring

2018-05-15 Thread Timo Walther
As far as I know, the bucketing sink is currenlty also limited by 
relying on Hadoops file system abstraction. It is planned to switch to 
Flink's file system abstraction which might also improve this situation. 
Kostas (in CC) might know more about it.


But I think we can discuss if an other behavior should be configurable 
as well. Would you be willing to contribute?


Regards,
Timo


Am 15.05.18 um 14:01 schrieb Xinyu Zhang:

Thanks for your reply.
Indeed, if a file is very large, it will take a long time. However, 
the the ??.valid-length?? file is not?0?2convenient for others who use the 
data in HDFS.
Maybe we should provide a configuration for users to choose which 
strategy they prefer.

Do you have any ideas?


--?0?2?0?2--
*??:*?0?2"Timo Walther";
*:*?0?22018??5??15??(??) 7:30
*??:*?0?2"dev";
*:*?0?2Re: Rewriting a new file instead of writing a ".valid-length" 
file inBucketSink when restoring


I guess writing a new file would take much longer than just using the
.valid-length file, especially if the files are very large. The
restoring time should be as minimal as possible to ensure little
downtime on restarts.

Regards,
Timo


Am 15.05.18 um 09:31 schrieb Gary Yao:
> Hi,
>
> The BucketingSink truncates the file if the Hadoop FileSystem 
supports this
> operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you 
using?

>
> Best,
> Gary
>
> [1]
> 
https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196b2fdaf85e0/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L301

>
> On Mon, May 14, 2018 at 1:37 PM, ?? <342689...@qq.com> wrote:
>
>> Hi
>>
>>
>> I'm trying to copy data from kafka to HDFS . The data in HDFS is 
used to

>> do other computations by others in map/reduce.
>> If some tasks failed, the ".valid-length" file is created for the low
>> version hadoop. The problem is other people must know how to deal 
with the

>> ".valid-length" file, otherwise, the data may be not exactly-once.
>> Hence, why not rewrite a new file when restoring instead of writing a
>> ".valid-length" file. In this way, others who use the data in HDFS 
don't

>> need to know how to deal with the ".valid-length" file.
>>
>>
>> Thanks!
>>
>>
>> Zhang Xinyu