Re: Plans for the future iceberg 0.11.0 release

2020-11-03 Thread Steven Wu
agree with OpenInx that FLIP-27 Flink source is unlikely to make Nov
release schedule. Then we should postpone it to 0.12.0

On Mon, Nov 2, 2020 at 5:23 PM OpenInx  wrote:

> Hi Ryan
>
> Got your plan ! If we plan to release 0.11.0 in November tentatively, then
> for flink I think we could finish the rewrite actions and flink streaming
> reader firstly.
>
> The flink cdc integration work and FLIP-27 would need more work, it's good
> to not block the release 0.11.0. But we could still make good progress in a
> separate PR.
>
> Thanks.
>
>
> On Tue, Nov 3, 2020 at 7:01 AM Ryan Blue 
> wrote:
>
>> Thanks for starting the 0.11.0 milestone! In the last sync, we talked
>> about having a release in November to make the new catalogs and possibly
>> S3FileIO available, so those should tentatively go on the 0.11.0 list as
>> well. I say tentatively because I'm in favor of releasing when features are
>> ready and trying not to block at this stage in the project.
>>
>> In addition, I think we can make some progress on Hive integration. There
>> is a PR to create tables using Hive DDL without needing to pass a
>> JSON-serialized schema that would be good to get in, and I think it would
>> be good to get the basic write path committed as well.
>>
>> On Sun, Nov 1, 2020 at 5:57 PM OpenInx  wrote:
>>
>>> Thanks for your context about FLIP-27, Steven !
>>>
>>> I will take a look for the patches under issues 1626.
>>>
>>> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu  wrote:
>>>
 OpenInx, thanks a lot for kicking off the discussion. Looks like my
 previous reply didn't reach the mailing list.

 > flink source based on the new FLIP-27 interface

 Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I
 have updated the issue [1] with the following scopes.

- Support both static/batch and continuous/streaming enumeration
modes
- Support only the simple assigner with no ordering/locality
guarantee when handing out split assignment. But make the interface
flexible to plug in different assigners (like the event time alignment
assigner or locality aware assigner)
- It will be @Experimenta status as nobody has run FLIP-27 sources
in production today. Flink 1.12.0 release (ETA end of Nov) will have the
first set of sources (Kafka and file) implemented with FLIP-27 source
framework. We still need to gain more production experiences.


 [1] https://github.com/apache/iceberg/issues/1626

 On Wed, Oct 28, 2020 at 12:15 AM OpenInx  wrote:

> Hi  dev
>
> As we know, we will be happy to cut the iceberg 0.10.0 candidate
> release this week.  I think it may be the time to plan for the future
> iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>
> I put the following issues into the newly created milestone:
>
> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>
> It's possible that we encounter too many small files issues when
> running the iceberg flink sink in real production because of the frequent
> checkpoint.  we have two approaches to handle the small files:
>
> a.  As the current spark rewrite actions designed,  flink will provide
> the similar rewrite actions which will be running in a batch job.  It's
> suitable to trigger the whole table or whole partitions compactions
> periodically, because this kind of rewrites will compact many large files
> and may consume lots of bandwidth.  Currently,   I and JunZheng are 
> working
> on this issue, and we've extracted the base rewrite actions between spark
> module and flink module.  The next step would be implementing rewrite
> actions in the flink module.
>
> b. Compact those small files in the flink streaming job when sinking
> into iceberg tables. That means we will provide a new rewrite operator
> chaining to the current IcebergFilesCommitter.  Once an iceberg 
> transaction
> has been committed, the newly introduced rewrite operator will check
> whether it needs a small compaction. Those actions only choose a few tiny
> size files (may be several KB, or MB, I think we could provide a
> configurable threshold) to rewrite, which can be achieved with a minimum
> cost and a higher efficiency of compaction.   Currently,  simonsssu from
> Tencent has provided a WIP PR here [2]
>
>
> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>
> We've almost implemented the row-level delete feature in the iceberg
> master branch, but still lack the ability to integrate with compute 
> engines
> (to be precise,  we spark/flink could read the expected records if someone
> has deleted the rows correctly but the write path is not available).  I am
> preparing the patch for sinking CDC into iceberg by flink streaming job
> here [3], I think it 

Re: Plans for the future iceberg 0.11.0 release

2020-11-02 Thread OpenInx
Hi Ryan

Got your plan ! If we plan to release 0.11.0 in November tentatively, then
for flink I think we could finish the rewrite actions and flink streaming
reader firstly.

The flink cdc integration work and FLIP-27 would need more work, it's good
to not block the release 0.11.0. But we could still make good progress in a
separate PR.

Thanks.


On Tue, Nov 3, 2020 at 7:01 AM Ryan Blue  wrote:

> Thanks for starting the 0.11.0 milestone! In the last sync, we talked
> about having a release in November to make the new catalogs and possibly
> S3FileIO available, so those should tentatively go on the 0.11.0 list as
> well. I say tentatively because I'm in favor of releasing when features are
> ready and trying not to block at this stage in the project.
>
> In addition, I think we can make some progress on Hive integration. There
> is a PR to create tables using Hive DDL without needing to pass a
> JSON-serialized schema that would be good to get in, and I think it would
> be good to get the basic write path committed as well.
>
> On Sun, Nov 1, 2020 at 5:57 PM OpenInx  wrote:
>
>> Thanks for your context about FLIP-27, Steven !
>>
>> I will take a look for the patches under issues 1626.
>>
>> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu  wrote:
>>
>>> OpenInx, thanks a lot for kicking off the discussion. Looks like my
>>> previous reply didn't reach the mailing list.
>>>
>>> > flink source based on the new FLIP-27 interface
>>>
>>> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
>>> updated the issue [1] with the following scopes.
>>>
>>>- Support both static/batch and continuous/streaming enumeration
>>>modes
>>>- Support only the simple assigner with no ordering/locality
>>>guarantee when handing out split assignment. But make the interface
>>>flexible to plug in different assigners (like the event time alignment
>>>assigner or locality aware assigner)
>>>- It will be @Experimenta status as nobody has run FLIP-27 sources
>>>in production today. Flink 1.12.0 release (ETA end of Nov) will have the
>>>first set of sources (Kafka and file) implemented with FLIP-27 source
>>>framework. We still need to gain more production experiences.
>>>
>>>
>>> [1] https://github.com/apache/iceberg/issues/1626
>>>
>>> On Wed, Oct 28, 2020 at 12:15 AM OpenInx  wrote:
>>>
 Hi  dev

 As we know, we will be happy to cut the iceberg 0.10.0 candidate
 release this week.  I think it may be the time to plan for the future
 iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]

 I put the following issues into the newly created milestone:

 1.   Apache Flink Rewrite Actions in Apache Iceberg.

 It's possible that we encounter too many small files issues when
 running the iceberg flink sink in real production because of the frequent
 checkpoint.  we have two approaches to handle the small files:

 a.  As the current spark rewrite actions designed,  flink will provide
 the similar rewrite actions which will be running in a batch job.  It's
 suitable to trigger the whole table or whole partitions compactions
 periodically, because this kind of rewrites will compact many large files
 and may consume lots of bandwidth.  Currently,   I and JunZheng are working
 on this issue, and we've extracted the base rewrite actions between spark
 module and flink module.  The next step would be implementing rewrite
 actions in the flink module.

 b. Compact those small files in the flink streaming job when sinking
 into iceberg tables. That means we will provide a new rewrite operator
 chaining to the current IcebergFilesCommitter.  Once an iceberg transaction
 has been committed, the newly introduced rewrite operator will check
 whether it needs a small compaction. Those actions only choose a few tiny
 size files (may be several KB, or MB, I think we could provide a
 configurable threshold) to rewrite, which can be achieved with a minimum
 cost and a higher efficiency of compaction.   Currently,  simonsssu from
 Tencent has provided a WIP PR here [2]


 2. Allow to write CDC or UPSERT records by flink streaming jobs.

 We've almost implemented the row-level delete feature in the iceberg
 master branch, but still lack the ability to integrate with compute engines
 (to be precise,  we spark/flink could read the expected records if someone
 has deleted the rows correctly but the write path is not available).  I am
 preparing the patch for sinking CDC into iceberg by flink streaming job
 here [3], I think it will be ready in the next few weeks.

 3.  Apache flink streaming reader.

 We've prepared a POC version in our alibaba internal branch, but still
 not contribute to apache iceberg now.  I think it's worth accomplishing
 that in the following days.


 The above are the

Re: Plans for the future iceberg 0.11.0 release

2020-11-02 Thread Ryan Blue
Thanks for starting the 0.11.0 milestone! In the last sync, we talked about
having a release in November to make the new catalogs and possibly S3FileIO
available, so those should tentatively go on the 0.11.0 list as well. I say
tentatively because I'm in favor of releasing when features are ready and
trying not to block at this stage in the project.

In addition, I think we can make some progress on Hive integration. There
is a PR to create tables using Hive DDL without needing to pass a
JSON-serialized schema that would be good to get in, and I think it would
be good to get the basic write path committed as well.

On Sun, Nov 1, 2020 at 5:57 PM OpenInx  wrote:

> Thanks for your context about FLIP-27, Steven !
>
> I will take a look for the patches under issues 1626.
>
> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu  wrote:
>
>> OpenInx, thanks a lot for kicking off the discussion. Looks like my
>> previous reply didn't reach the mailing list.
>>
>> > flink source based on the new FLIP-27 interface
>>
>> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
>> updated the issue [1] with the following scopes.
>>
>>- Support both static/batch and continuous/streaming enumeration modes
>>- Support only the simple assigner with no ordering/locality
>>guarantee when handing out split assignment. But make the interface
>>flexible to plug in different assigners (like the event time alignment
>>assigner or locality aware assigner)
>>- It will be @Experimenta status as nobody has run FLIP-27 sources in
>>production today. Flink 1.12.0 release (ETA end of Nov) will have the 
>> first
>>set of sources (Kafka and file) implemented with FLIP-27 source framework.
>>We still need to gain more production experiences.
>>
>>
>> [1] https://github.com/apache/iceberg/issues/1626
>>
>> On Wed, Oct 28, 2020 at 12:15 AM OpenInx  wrote:
>>
>>> Hi  dev
>>>
>>> As we know, we will be happy to cut the iceberg 0.10.0 candidate release
>>> this week.  I think it may be the time to plan for the future iceberg
>>> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>>>
>>> I put the following issues into the newly created milestone:
>>>
>>> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>>>
>>> It's possible that we encounter too many small files issues when running
>>> the iceberg flink sink in real production because of the frequent
>>> checkpoint.  we have two approaches to handle the small files:
>>>
>>> a.  As the current spark rewrite actions designed,  flink will provide
>>> the similar rewrite actions which will be running in a batch job.  It's
>>> suitable to trigger the whole table or whole partitions compactions
>>> periodically, because this kind of rewrites will compact many large files
>>> and may consume lots of bandwidth.  Currently,   I and JunZheng are working
>>> on this issue, and we've extracted the base rewrite actions between spark
>>> module and flink module.  The next step would be implementing rewrite
>>> actions in the flink module.
>>>
>>> b. Compact those small files in the flink streaming job when sinking
>>> into iceberg tables. That means we will provide a new rewrite operator
>>> chaining to the current IcebergFilesCommitter.  Once an iceberg transaction
>>> has been committed, the newly introduced rewrite operator will check
>>> whether it needs a small compaction. Those actions only choose a few tiny
>>> size files (may be several KB, or MB, I think we could provide a
>>> configurable threshold) to rewrite, which can be achieved with a minimum
>>> cost and a higher efficiency of compaction.   Currently,  simonsssu from
>>> Tencent has provided a WIP PR here [2]
>>>
>>>
>>> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>>>
>>> We've almost implemented the row-level delete feature in the iceberg
>>> master branch, but still lack the ability to integrate with compute engines
>>> (to be precise,  we spark/flink could read the expected records if someone
>>> has deleted the rows correctly but the write path is not available).  I am
>>> preparing the patch for sinking CDC into iceberg by flink streaming job
>>> here [3], I think it will be ready in the next few weeks.
>>>
>>> 3.  Apache flink streaming reader.
>>>
>>> We've prepared a POC version in our alibaba internal branch, but still
>>> not contribute to apache iceberg now.  I think it's worth accomplishing
>>> that in the following days.
>>>
>>>
>>> The above are the issues that I think it's worth to merge before iceberg
>>> 0.11.0.  But  I' not quite sure what's the plan for the things:
>>>
>>> 1.  I know @Anton Okolnychyi  is working on
>>> spark-sql extensions for iceberg, I guess there's a high probability to get
>>> that ?  [4]
>>>
>>> 2.  @Steven Wu  from netflix is working on flink
>>> source based on the new FLIP-27 interface,  thoughts ? [5]
>>>
>>> 3.  How about the Spark Row-Delete integration work ?
>>>
>>>
>>>
>>> [1].  https://github.com/apache/iceb

Re: Plans for the future iceberg 0.11.0 release

2020-11-01 Thread OpenInx
Thanks for your context about FLIP-27, Steven !

I will take a look for the patches under issues 1626.

On Sat, Oct 31, 2020 at 2:03 AM Steven Wu  wrote:

> OpenInx, thanks a lot for kicking off the discussion. Looks like my
> previous reply didn't reach the mailing list.
>
> > flink source based on the new FLIP-27 interface
>
> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
> updated the issue [1] with the following scopes.
>
>- Support both static/batch and continuous/streaming enumeration modes
>- Support only the simple assigner with no ordering/locality guarantee
>when handing out split assignment. But make the interface flexible to plug
>in different assigners (like the event time alignment assigner or locality
>aware assigner)
>- It will be @Experimenta status as nobody has run FLIP-27 sources in
>production today. Flink 1.12.0 release (ETA end of Nov) will have the first
>set of sources (Kafka and file) implemented with FLIP-27 source framework.
>We still need to gain more production experiences.
>
>
> [1] https://github.com/apache/iceberg/issues/1626
>
> On Wed, Oct 28, 2020 at 12:15 AM OpenInx  wrote:
>
>> Hi  dev
>>
>> As we know, we will be happy to cut the iceberg 0.10.0 candidate release
>> this week.  I think it may be the time to plan for the future iceberg
>> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>>
>> I put the following issues into the newly created milestone:
>>
>> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>>
>> It's possible that we encounter too many small files issues when running
>> the iceberg flink sink in real production because of the frequent
>> checkpoint.  we have two approaches to handle the small files:
>>
>> a.  As the current spark rewrite actions designed,  flink will provide
>> the similar rewrite actions which will be running in a batch job.  It's
>> suitable to trigger the whole table or whole partitions compactions
>> periodically, because this kind of rewrites will compact many large files
>> and may consume lots of bandwidth.  Currently,   I and JunZheng are working
>> on this issue, and we've extracted the base rewrite actions between spark
>> module and flink module.  The next step would be implementing rewrite
>> actions in the flink module.
>>
>> b. Compact those small files in the flink streaming job when sinking into
>> iceberg tables. That means we will provide a new rewrite operator chaining
>> to the current IcebergFilesCommitter.  Once an iceberg transaction has been
>> committed, the newly introduced rewrite operator will check whether it
>> needs a small compaction. Those actions only choose a few tiny size files
>> (may be several KB, or MB, I think we could provide a configurable
>> threshold) to rewrite, which can be achieved with a minimum cost and a
>> higher efficiency of compaction.   Currently,  simonsssu from Tencent has
>> provided a WIP PR here [2]
>>
>>
>> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>>
>> We've almost implemented the row-level delete feature in the iceberg
>> master branch, but still lack the ability to integrate with compute engines
>> (to be precise,  we spark/flink could read the expected records if someone
>> has deleted the rows correctly but the write path is not available).  I am
>> preparing the patch for sinking CDC into iceberg by flink streaming job
>> here [3], I think it will be ready in the next few weeks.
>>
>> 3.  Apache flink streaming reader.
>>
>> We've prepared a POC version in our alibaba internal branch, but still
>> not contribute to apache iceberg now.  I think it's worth accomplishing
>> that in the following days.
>>
>>
>> The above are the issues that I think it's worth to merge before iceberg
>> 0.11.0.  But  I' not quite sure what's the plan for the things:
>>
>> 1.  I know @Anton Okolnychyi  is working on
>> spark-sql extensions for iceberg, I guess there's a high probability to get
>> that ?  [4]
>>
>> 2.  @Steven Wu  from netflix is working on flink
>> source based on the new FLIP-27 interface,  thoughts ? [5]
>>
>> 3.  How about the Spark Row-Delete integration work ?
>>
>>
>>
>> [1].  https://github.com/apache/iceberg/milestone/12
>> [2]. https://github.com/apache/iceberg/pull/1669/files
>> [3]. https://github.com/apache/iceberg/pull/1663
>> [4]. https://github.com/apache/iceberg/milestone/11
>> [5]. https://github.com/apache/iceberg/issues/1626
>>
>


Re: Plans for the future iceberg 0.11.0 release

2020-10-30 Thread Steven Wu
OpenInx, thanks a lot for kicking off the discussion. Looks like my
previous reply didn't reach the mailing list.

> flink source based on the new FLIP-27 interface

Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
updated the issue [1] with the following scopes.

   - Support both static/batch and continuous/streaming enumeration modes
   - Support only the simple assigner with no ordering/locality guarantee
   when handing out split assignment. But make the interface flexible to plug
   in different assigners (like the event time alignment assigner or locality
   aware assigner)
   - It will be @Experimenta status as nobody has run FLIP-27 sources in
   production today. Flink 1.12.0 release (ETA end of Nov) will have the first
   set of sources (Kafka and file) implemented with FLIP-27 source framework.
   We still need to gain more production experiences.


[1] https://github.com/apache/iceberg/issues/1626

On Wed, Oct 28, 2020 at 12:15 AM OpenInx  wrote:

> Hi  dev
>
> As we know, we will be happy to cut the iceberg 0.10.0 candidate release
> this week.  I think it may be the time to plan for the future iceberg
> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>
> I put the following issues into the newly created milestone:
>
> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>
> It's possible that we encounter too many small files issues when running
> the iceberg flink sink in real production because of the frequent
> checkpoint.  we have two approaches to handle the small files:
>
> a.  As the current spark rewrite actions designed,  flink will provide the
> similar rewrite actions which will be running in a batch job.  It's
> suitable to trigger the whole table or whole partitions compactions
> periodically, because this kind of rewrites will compact many large files
> and may consume lots of bandwidth.  Currently,   I and JunZheng are working
> on this issue, and we've extracted the base rewrite actions between spark
> module and flink module.  The next step would be implementing rewrite
> actions in the flink module.
>
> b. Compact those small files in the flink streaming job when sinking into
> iceberg tables. That means we will provide a new rewrite operator chaining
> to the current IcebergFilesCommitter.  Once an iceberg transaction has been
> committed, the newly introduced rewrite operator will check whether it
> needs a small compaction. Those actions only choose a few tiny size files
> (may be several KB, or MB, I think we could provide a configurable
> threshold) to rewrite, which can be achieved with a minimum cost and a
> higher efficiency of compaction.   Currently,  simonsssu from Tencent has
> provided a WIP PR here [2]
>
>
> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>
> We've almost implemented the row-level delete feature in the iceberg
> master branch, but still lack the ability to integrate with compute engines
> (to be precise,  we spark/flink could read the expected records if someone
> has deleted the rows correctly but the write path is not available).  I am
> preparing the patch for sinking CDC into iceberg by flink streaming job
> here [3], I think it will be ready in the next few weeks.
>
> 3.  Apache flink streaming reader.
>
> We've prepared a POC version in our alibaba internal branch, but still not
> contribute to apache iceberg now.  I think it's worth accomplishing that in
> the following days.
>
>
> The above are the issues that I think it's worth to merge before iceberg
> 0.11.0.  But  I' not quite sure what's the plan for the things:
>
> 1.  I know @Anton Okolnychyi  is working on
> spark-sql extensions for iceberg, I guess there's a high probability to get
> that ?  [4]
>
> 2.  @Steven Wu  from netflix is working on flink
> source based on the new FLIP-27 interface,  thoughts ? [5]
>
> 3.  How about the Spark Row-Delete integration work ?
>
>
>
> [1].  https://github.com/apache/iceberg/milestone/12
> [2]. https://github.com/apache/iceberg/pull/1669/files
> [3]. https://github.com/apache/iceberg/pull/1663
> [4]. https://github.com/apache/iceberg/milestone/11
> [5]. https://github.com/apache/iceberg/issues/1626
>


Plans for the future iceberg 0.11.0 release

2020-10-28 Thread OpenInx
Hi  dev

As we know, we will be happy to cut the iceberg 0.10.0 candidate release
this week.  I think it may be the time to plan for the future iceberg
0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]

I put the following issues into the newly created milestone:

1.   Apache Flink Rewrite Actions in Apache Iceberg.

It's possible that we encounter too many small files issues when running
the iceberg flink sink in real production because of the frequent
checkpoint.  we have two approaches to handle the small files:

a.  As the current spark rewrite actions designed,  flink will provide the
similar rewrite actions which will be running in a batch job.  It's
suitable to trigger the whole table or whole partitions compactions
periodically, because this kind of rewrites will compact many large files
and may consume lots of bandwidth.  Currently,   I and JunZheng are working
on this issue, and we've extracted the base rewrite actions between spark
module and flink module.  The next step would be implementing rewrite
actions in the flink module.

b. Compact those small files in the flink streaming job when sinking into
iceberg tables. That means we will provide a new rewrite operator chaining
to the current IcebergFilesCommitter.  Once an iceberg transaction has been
committed, the newly introduced rewrite operator will check whether it
needs a small compaction. Those actions only choose a few tiny size files
(may be several KB, or MB, I think we could provide a configurable
threshold) to rewrite, which can be achieved with a minimum cost and a
higher efficiency of compaction.   Currently,  simonsssu from Tencent has
provided a WIP PR here [2]


2. Allow to write CDC or UPSERT records by flink streaming jobs.

We've almost implemented the row-level delete feature in the iceberg master
branch, but still lack the ability to integrate with compute engines (to be
precise,  we spark/flink could read the expected records if someone has
deleted the rows correctly but the write path is not available).  I am
preparing the patch for sinking CDC into iceberg by flink streaming job
here [3], I think it will be ready in the next few weeks.

3.  Apache flink streaming reader.

We've prepared a POC version in our alibaba internal branch, but still not
contribute to apache iceberg now.  I think it's worth accomplishing that in
the following days.


The above are the issues that I think it's worth to merge before iceberg
0.11.0.  But  I' not quite sure what's the plan for the things:

1.  I know @Anton Okolnychyi  is working on
spark-sql extensions for iceberg, I guess there's a high probability to get
that ?  [4]

2.  @Steven Wu  from netflix is working on flink
source based on the new FLIP-27 interface,  thoughts ? [5]

3.  How about the Spark Row-Delete integration work ?



[1].  https://github.com/apache/iceberg/milestone/12
[2]. https://github.com/apache/iceberg/pull/1669/files
[3]. https://github.com/apache/iceberg/pull/1663
[4]. https://github.com/apache/iceberg/milestone/11
[5]. https://github.com/apache/iceberg/issues/1626