Re: Plans for the future iceberg 0.11.0 release
agree with OpenInx that FLIP-27 Flink source is unlikely to make Nov release schedule. Then we should postpone it to 0.12.0 On Mon, Nov 2, 2020 at 5:23 PM OpenInx wrote: > Hi Ryan > > Got your plan ! If we plan to release 0.11.0 in November tentatively, then > for flink I think we could finish the rewrite actions and flink streaming > reader firstly. > > The flink cdc integration work and FLIP-27 would need more work, it's good > to not block the release 0.11.0. But we could still make good progress in a > separate PR. > > Thanks. > > > On Tue, Nov 3, 2020 at 7:01 AM Ryan Blue > wrote: > >> Thanks for starting the 0.11.0 milestone! In the last sync, we talked >> about having a release in November to make the new catalogs and possibly >> S3FileIO available, so those should tentatively go on the 0.11.0 list as >> well. I say tentatively because I'm in favor of releasing when features are >> ready and trying not to block at this stage in the project. >> >> In addition, I think we can make some progress on Hive integration. There >> is a PR to create tables using Hive DDL without needing to pass a >> JSON-serialized schema that would be good to get in, and I think it would >> be good to get the basic write path committed as well. >> >> On Sun, Nov 1, 2020 at 5:57 PM OpenInx wrote: >> >>> Thanks for your context about FLIP-27, Steven ! >>> >>> I will take a look for the patches under issues 1626. >>> >>> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu wrote: >>> OpenInx, thanks a lot for kicking off the discussion. Looks like my previous reply didn't reach the mailing list. > flink source based on the new FLIP-27 interface Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have updated the issue [1] with the following scopes. - Support both static/batch and continuous/streaming enumeration modes - Support only the simple assigner with no ordering/locality guarantee when handing out split assignment. But make the interface flexible to plug in different assigners (like the event time alignment assigner or locality aware assigner) - It will be @Experimenta status as nobody has run FLIP-27 sources in production today. Flink 1.12.0 release (ETA end of Nov) will have the first set of sources (Kafka and file) implemented with FLIP-27 source framework. We still need to gain more production experiences. [1] https://github.com/apache/iceberg/issues/1626 On Wed, Oct 28, 2020 at 12:15 AM OpenInx wrote: > Hi dev > > As we know, we will be happy to cut the iceberg 0.10.0 candidate > release this week. I think it may be the time to plan for the future > iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] > > I put the following issues into the newly created milestone: > > 1. Apache Flink Rewrite Actions in Apache Iceberg. > > It's possible that we encounter too many small files issues when > running the iceberg flink sink in real production because of the frequent > checkpoint. we have two approaches to handle the small files: > > a. As the current spark rewrite actions designed, flink will provide > the similar rewrite actions which will be running in a batch job. It's > suitable to trigger the whole table or whole partitions compactions > periodically, because this kind of rewrites will compact many large files > and may consume lots of bandwidth. Currently, I and JunZheng are > working > on this issue, and we've extracted the base rewrite actions between spark > module and flink module. The next step would be implementing rewrite > actions in the flink module. > > b. Compact those small files in the flink streaming job when sinking > into iceberg tables. That means we will provide a new rewrite operator > chaining to the current IcebergFilesCommitter. Once an iceberg > transaction > has been committed, the newly introduced rewrite operator will check > whether it needs a small compaction. Those actions only choose a few tiny > size files (may be several KB, or MB, I think we could provide a > configurable threshold) to rewrite, which can be achieved with a minimum > cost and a higher efficiency of compaction. Currently, simonsssu from > Tencent has provided a WIP PR here [2] > > > 2. Allow to write CDC or UPSERT records by flink streaming jobs. > > We've almost implemented the row-level delete feature in the iceberg > master branch, but still lack the ability to integrate with compute > engines > (to be precise, we spark/flink could read the expected records if someone > has deleted the rows correctly but the write path is not available). I am > preparing the patch for sinking CDC into iceberg by flink streaming job > here [3], I think it
Re: Plans for the future iceberg 0.11.0 release
Hi Ryan Got your plan ! If we plan to release 0.11.0 in November tentatively, then for flink I think we could finish the rewrite actions and flink streaming reader firstly. The flink cdc integration work and FLIP-27 would need more work, it's good to not block the release 0.11.0. But we could still make good progress in a separate PR. Thanks. On Tue, Nov 3, 2020 at 7:01 AM Ryan Blue wrote: > Thanks for starting the 0.11.0 milestone! In the last sync, we talked > about having a release in November to make the new catalogs and possibly > S3FileIO available, so those should tentatively go on the 0.11.0 list as > well. I say tentatively because I'm in favor of releasing when features are > ready and trying not to block at this stage in the project. > > In addition, I think we can make some progress on Hive integration. There > is a PR to create tables using Hive DDL without needing to pass a > JSON-serialized schema that would be good to get in, and I think it would > be good to get the basic write path committed as well. > > On Sun, Nov 1, 2020 at 5:57 PM OpenInx wrote: > >> Thanks for your context about FLIP-27, Steven ! >> >> I will take a look for the patches under issues 1626. >> >> On Sat, Oct 31, 2020 at 2:03 AM Steven Wu wrote: >> >>> OpenInx, thanks a lot for kicking off the discussion. Looks like my >>> previous reply didn't reach the mailing list. >>> >>> > flink source based on the new FLIP-27 interface >>> >>> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have >>> updated the issue [1] with the following scopes. >>> >>>- Support both static/batch and continuous/streaming enumeration >>>modes >>>- Support only the simple assigner with no ordering/locality >>>guarantee when handing out split assignment. But make the interface >>>flexible to plug in different assigners (like the event time alignment >>>assigner or locality aware assigner) >>>- It will be @Experimenta status as nobody has run FLIP-27 sources >>>in production today. Flink 1.12.0 release (ETA end of Nov) will have the >>>first set of sources (Kafka and file) implemented with FLIP-27 source >>>framework. We still need to gain more production experiences. >>> >>> >>> [1] https://github.com/apache/iceberg/issues/1626 >>> >>> On Wed, Oct 28, 2020 at 12:15 AM OpenInx wrote: >>> Hi dev As we know, we will be happy to cut the iceberg 0.10.0 candidate release this week. I think it may be the time to plan for the future iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] I put the following issues into the newly created milestone: 1. Apache Flink Rewrite Actions in Apache Iceberg. It's possible that we encounter too many small files issues when running the iceberg flink sink in real production because of the frequent checkpoint. we have two approaches to handle the small files: a. As the current spark rewrite actions designed, flink will provide the similar rewrite actions which will be running in a batch job. It's suitable to trigger the whole table or whole partitions compactions periodically, because this kind of rewrites will compact many large files and may consume lots of bandwidth. Currently, I and JunZheng are working on this issue, and we've extracted the base rewrite actions between spark module and flink module. The next step would be implementing rewrite actions in the flink module. b. Compact those small files in the flink streaming job when sinking into iceberg tables. That means we will provide a new rewrite operator chaining to the current IcebergFilesCommitter. Once an iceberg transaction has been committed, the newly introduced rewrite operator will check whether it needs a small compaction. Those actions only choose a few tiny size files (may be several KB, or MB, I think we could provide a configurable threshold) to rewrite, which can be achieved with a minimum cost and a higher efficiency of compaction. Currently, simonsssu from Tencent has provided a WIP PR here [2] 2. Allow to write CDC or UPSERT records by flink streaming jobs. We've almost implemented the row-level delete feature in the iceberg master branch, but still lack the ability to integrate with compute engines (to be precise, we spark/flink could read the expected records if someone has deleted the rows correctly but the write path is not available). I am preparing the patch for sinking CDC into iceberg by flink streaming job here [3], I think it will be ready in the next few weeks. 3. Apache flink streaming reader. We've prepared a POC version in our alibaba internal branch, but still not contribute to apache iceberg now. I think it's worth accomplishing that in the following days. The above are the
Re: Plans for the future iceberg 0.11.0 release
Thanks for starting the 0.11.0 milestone! In the last sync, we talked about having a release in November to make the new catalogs and possibly S3FileIO available, so those should tentatively go on the 0.11.0 list as well. I say tentatively because I'm in favor of releasing when features are ready and trying not to block at this stage in the project. In addition, I think we can make some progress on Hive integration. There is a PR to create tables using Hive DDL without needing to pass a JSON-serialized schema that would be good to get in, and I think it would be good to get the basic write path committed as well. On Sun, Nov 1, 2020 at 5:57 PM OpenInx wrote: > Thanks for your context about FLIP-27, Steven ! > > I will take a look for the patches under issues 1626. > > On Sat, Oct 31, 2020 at 2:03 AM Steven Wu wrote: > >> OpenInx, thanks a lot for kicking off the discussion. Looks like my >> previous reply didn't reach the mailing list. >> >> > flink source based on the new FLIP-27 interface >> >> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have >> updated the issue [1] with the following scopes. >> >>- Support both static/batch and continuous/streaming enumeration modes >>- Support only the simple assigner with no ordering/locality >>guarantee when handing out split assignment. But make the interface >>flexible to plug in different assigners (like the event time alignment >>assigner or locality aware assigner) >>- It will be @Experimenta status as nobody has run FLIP-27 sources in >>production today. Flink 1.12.0 release (ETA end of Nov) will have the >> first >>set of sources (Kafka and file) implemented with FLIP-27 source framework. >>We still need to gain more production experiences. >> >> >> [1] https://github.com/apache/iceberg/issues/1626 >> >> On Wed, Oct 28, 2020 at 12:15 AM OpenInx wrote: >> >>> Hi dev >>> >>> As we know, we will be happy to cut the iceberg 0.10.0 candidate release >>> this week. I think it may be the time to plan for the future iceberg >>> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] >>> >>> I put the following issues into the newly created milestone: >>> >>> 1. Apache Flink Rewrite Actions in Apache Iceberg. >>> >>> It's possible that we encounter too many small files issues when running >>> the iceberg flink sink in real production because of the frequent >>> checkpoint. we have two approaches to handle the small files: >>> >>> a. As the current spark rewrite actions designed, flink will provide >>> the similar rewrite actions which will be running in a batch job. It's >>> suitable to trigger the whole table or whole partitions compactions >>> periodically, because this kind of rewrites will compact many large files >>> and may consume lots of bandwidth. Currently, I and JunZheng are working >>> on this issue, and we've extracted the base rewrite actions between spark >>> module and flink module. The next step would be implementing rewrite >>> actions in the flink module. >>> >>> b. Compact those small files in the flink streaming job when sinking >>> into iceberg tables. That means we will provide a new rewrite operator >>> chaining to the current IcebergFilesCommitter. Once an iceberg transaction >>> has been committed, the newly introduced rewrite operator will check >>> whether it needs a small compaction. Those actions only choose a few tiny >>> size files (may be several KB, or MB, I think we could provide a >>> configurable threshold) to rewrite, which can be achieved with a minimum >>> cost and a higher efficiency of compaction. Currently, simonsssu from >>> Tencent has provided a WIP PR here [2] >>> >>> >>> 2. Allow to write CDC or UPSERT records by flink streaming jobs. >>> >>> We've almost implemented the row-level delete feature in the iceberg >>> master branch, but still lack the ability to integrate with compute engines >>> (to be precise, we spark/flink could read the expected records if someone >>> has deleted the rows correctly but the write path is not available). I am >>> preparing the patch for sinking CDC into iceberg by flink streaming job >>> here [3], I think it will be ready in the next few weeks. >>> >>> 3. Apache flink streaming reader. >>> >>> We've prepared a POC version in our alibaba internal branch, but still >>> not contribute to apache iceberg now. I think it's worth accomplishing >>> that in the following days. >>> >>> >>> The above are the issues that I think it's worth to merge before iceberg >>> 0.11.0. But I' not quite sure what's the plan for the things: >>> >>> 1. I know @Anton Okolnychyi is working on >>> spark-sql extensions for iceberg, I guess there's a high probability to get >>> that ? [4] >>> >>> 2. @Steven Wu from netflix is working on flink >>> source based on the new FLIP-27 interface, thoughts ? [5] >>> >>> 3. How about the Spark Row-Delete integration work ? >>> >>> >>> >>> [1]. https://github.com/apache/iceb
Re: Plans for the future iceberg 0.11.0 release
Thanks for your context about FLIP-27, Steven ! I will take a look for the patches under issues 1626. On Sat, Oct 31, 2020 at 2:03 AM Steven Wu wrote: > OpenInx, thanks a lot for kicking off the discussion. Looks like my > previous reply didn't reach the mailing list. > > > flink source based on the new FLIP-27 interface > > Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have > updated the issue [1] with the following scopes. > >- Support both static/batch and continuous/streaming enumeration modes >- Support only the simple assigner with no ordering/locality guarantee >when handing out split assignment. But make the interface flexible to plug >in different assigners (like the event time alignment assigner or locality >aware assigner) >- It will be @Experimenta status as nobody has run FLIP-27 sources in >production today. Flink 1.12.0 release (ETA end of Nov) will have the first >set of sources (Kafka and file) implemented with FLIP-27 source framework. >We still need to gain more production experiences. > > > [1] https://github.com/apache/iceberg/issues/1626 > > On Wed, Oct 28, 2020 at 12:15 AM OpenInx wrote: > >> Hi dev >> >> As we know, we will be happy to cut the iceberg 0.10.0 candidate release >> this week. I think it may be the time to plan for the future iceberg >> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] >> >> I put the following issues into the newly created milestone: >> >> 1. Apache Flink Rewrite Actions in Apache Iceberg. >> >> It's possible that we encounter too many small files issues when running >> the iceberg flink sink in real production because of the frequent >> checkpoint. we have two approaches to handle the small files: >> >> a. As the current spark rewrite actions designed, flink will provide >> the similar rewrite actions which will be running in a batch job. It's >> suitable to trigger the whole table or whole partitions compactions >> periodically, because this kind of rewrites will compact many large files >> and may consume lots of bandwidth. Currently, I and JunZheng are working >> on this issue, and we've extracted the base rewrite actions between spark >> module and flink module. The next step would be implementing rewrite >> actions in the flink module. >> >> b. Compact those small files in the flink streaming job when sinking into >> iceberg tables. That means we will provide a new rewrite operator chaining >> to the current IcebergFilesCommitter. Once an iceberg transaction has been >> committed, the newly introduced rewrite operator will check whether it >> needs a small compaction. Those actions only choose a few tiny size files >> (may be several KB, or MB, I think we could provide a configurable >> threshold) to rewrite, which can be achieved with a minimum cost and a >> higher efficiency of compaction. Currently, simonsssu from Tencent has >> provided a WIP PR here [2] >> >> >> 2. Allow to write CDC or UPSERT records by flink streaming jobs. >> >> We've almost implemented the row-level delete feature in the iceberg >> master branch, but still lack the ability to integrate with compute engines >> (to be precise, we spark/flink could read the expected records if someone >> has deleted the rows correctly but the write path is not available). I am >> preparing the patch for sinking CDC into iceberg by flink streaming job >> here [3], I think it will be ready in the next few weeks. >> >> 3. Apache flink streaming reader. >> >> We've prepared a POC version in our alibaba internal branch, but still >> not contribute to apache iceberg now. I think it's worth accomplishing >> that in the following days. >> >> >> The above are the issues that I think it's worth to merge before iceberg >> 0.11.0. But I' not quite sure what's the plan for the things: >> >> 1. I know @Anton Okolnychyi is working on >> spark-sql extensions for iceberg, I guess there's a high probability to get >> that ? [4] >> >> 2. @Steven Wu from netflix is working on flink >> source based on the new FLIP-27 interface, thoughts ? [5] >> >> 3. How about the Spark Row-Delete integration work ? >> >> >> >> [1]. https://github.com/apache/iceberg/milestone/12 >> [2]. https://github.com/apache/iceberg/pull/1669/files >> [3]. https://github.com/apache/iceberg/pull/1663 >> [4]. https://github.com/apache/iceberg/milestone/11 >> [5]. https://github.com/apache/iceberg/issues/1626 >> >
Re: Plans for the future iceberg 0.11.0 release
OpenInx, thanks a lot for kicking off the discussion. Looks like my previous reply didn't reach the mailing list. > flink source based on the new FLIP-27 interface Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have updated the issue [1] with the following scopes. - Support both static/batch and continuous/streaming enumeration modes - Support only the simple assigner with no ordering/locality guarantee when handing out split assignment. But make the interface flexible to plug in different assigners (like the event time alignment assigner or locality aware assigner) - It will be @Experimenta status as nobody has run FLIP-27 sources in production today. Flink 1.12.0 release (ETA end of Nov) will have the first set of sources (Kafka and file) implemented with FLIP-27 source framework. We still need to gain more production experiences. [1] https://github.com/apache/iceberg/issues/1626 On Wed, Oct 28, 2020 at 12:15 AM OpenInx wrote: > Hi dev > > As we know, we will be happy to cut the iceberg 0.10.0 candidate release > this week. I think it may be the time to plan for the future iceberg > 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] > > I put the following issues into the newly created milestone: > > 1. Apache Flink Rewrite Actions in Apache Iceberg. > > It's possible that we encounter too many small files issues when running > the iceberg flink sink in real production because of the frequent > checkpoint. we have two approaches to handle the small files: > > a. As the current spark rewrite actions designed, flink will provide the > similar rewrite actions which will be running in a batch job. It's > suitable to trigger the whole table or whole partitions compactions > periodically, because this kind of rewrites will compact many large files > and may consume lots of bandwidth. Currently, I and JunZheng are working > on this issue, and we've extracted the base rewrite actions between spark > module and flink module. The next step would be implementing rewrite > actions in the flink module. > > b. Compact those small files in the flink streaming job when sinking into > iceberg tables. That means we will provide a new rewrite operator chaining > to the current IcebergFilesCommitter. Once an iceberg transaction has been > committed, the newly introduced rewrite operator will check whether it > needs a small compaction. Those actions only choose a few tiny size files > (may be several KB, or MB, I think we could provide a configurable > threshold) to rewrite, which can be achieved with a minimum cost and a > higher efficiency of compaction. Currently, simonsssu from Tencent has > provided a WIP PR here [2] > > > 2. Allow to write CDC or UPSERT records by flink streaming jobs. > > We've almost implemented the row-level delete feature in the iceberg > master branch, but still lack the ability to integrate with compute engines > (to be precise, we spark/flink could read the expected records if someone > has deleted the rows correctly but the write path is not available). I am > preparing the patch for sinking CDC into iceberg by flink streaming job > here [3], I think it will be ready in the next few weeks. > > 3. Apache flink streaming reader. > > We've prepared a POC version in our alibaba internal branch, but still not > contribute to apache iceberg now. I think it's worth accomplishing that in > the following days. > > > The above are the issues that I think it's worth to merge before iceberg > 0.11.0. But I' not quite sure what's the plan for the things: > > 1. I know @Anton Okolnychyi is working on > spark-sql extensions for iceberg, I guess there's a high probability to get > that ? [4] > > 2. @Steven Wu from netflix is working on flink > source based on the new FLIP-27 interface, thoughts ? [5] > > 3. How about the Spark Row-Delete integration work ? > > > > [1]. https://github.com/apache/iceberg/milestone/12 > [2]. https://github.com/apache/iceberg/pull/1669/files > [3]. https://github.com/apache/iceberg/pull/1663 > [4]. https://github.com/apache/iceberg/milestone/11 > [5]. https://github.com/apache/iceberg/issues/1626 >
Plans for the future iceberg 0.11.0 release
Hi dev As we know, we will be happy to cut the iceberg 0.10.0 candidate release this week. I think it may be the time to plan for the future iceberg 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1] I put the following issues into the newly created milestone: 1. Apache Flink Rewrite Actions in Apache Iceberg. It's possible that we encounter too many small files issues when running the iceberg flink sink in real production because of the frequent checkpoint. we have two approaches to handle the small files: a. As the current spark rewrite actions designed, flink will provide the similar rewrite actions which will be running in a batch job. It's suitable to trigger the whole table or whole partitions compactions periodically, because this kind of rewrites will compact many large files and may consume lots of bandwidth. Currently, I and JunZheng are working on this issue, and we've extracted the base rewrite actions between spark module and flink module. The next step would be implementing rewrite actions in the flink module. b. Compact those small files in the flink streaming job when sinking into iceberg tables. That means we will provide a new rewrite operator chaining to the current IcebergFilesCommitter. Once an iceberg transaction has been committed, the newly introduced rewrite operator will check whether it needs a small compaction. Those actions only choose a few tiny size files (may be several KB, or MB, I think we could provide a configurable threshold) to rewrite, which can be achieved with a minimum cost and a higher efficiency of compaction. Currently, simonsssu from Tencent has provided a WIP PR here [2] 2. Allow to write CDC or UPSERT records by flink streaming jobs. We've almost implemented the row-level delete feature in the iceberg master branch, but still lack the ability to integrate with compute engines (to be precise, we spark/flink could read the expected records if someone has deleted the rows correctly but the write path is not available). I am preparing the patch for sinking CDC into iceberg by flink streaming job here [3], I think it will be ready in the next few weeks. 3. Apache flink streaming reader. We've prepared a POC version in our alibaba internal branch, but still not contribute to apache iceberg now. I think it's worth accomplishing that in the following days. The above are the issues that I think it's worth to merge before iceberg 0.11.0. But I' not quite sure what's the plan for the things: 1. I know @Anton Okolnychyi is working on spark-sql extensions for iceberg, I guess there's a high probability to get that ? [4] 2. @Steven Wu from netflix is working on flink source based on the new FLIP-27 interface, thoughts ? [5] 3. How about the Spark Row-Delete integration work ? [1]. https://github.com/apache/iceberg/milestone/12 [2]. https://github.com/apache/iceberg/pull/1669/files [3]. https://github.com/apache/iceberg/pull/1663 [4]. https://github.com/apache/iceberg/milestone/11 [5]. https://github.com/apache/iceberg/issues/1626
