[jira] [Updated] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files
[ https://issues.apache.org/jira/browse/HUDI-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8210: - Status: In Progress (was: Open) > [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log > files > -- > > Key: HUDI-8210 > URL: https://issues.apache.org/jira/browse/HUDI-8210 > Project: Apache Hudi > Issue Type: New Feature > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8211) Map table versions to log block and log format versions
[ https://issues.apache.org/jira/browse/HUDI-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8211: - Status: In Progress (was: Open) > Map table versions to log block and log format versions > --- > > Key: HUDI-8211 > URL: https://issues.apache.org/jira/browse/HUDI-8211 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8211) Map table versions to log block and log format versions
[ https://issues.apache.org/jira/browse/HUDI-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8211: - Fix Version/s: 1.0.0 > Map table versions to log block and log format versions > --- > > Key: HUDI-8211 > URL: https://issues.apache.org/jira/browse/HUDI-8211 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files
[ https://issues.apache.org/jira/browse/HUDI-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8210: - Sprint: Hudi 1.0 Sprint 2024/09/16-22 > [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log > files > -- > > Key: HUDI-8210 > URL: https://issues.apache.org/jira/browse/HUDI-8210 > Project: Apache Hudi > Issue Type: New Feature > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8211) Map table versions to log block and log format versions
Vinoth Chandar created HUDI-8211: Summary: Map table versions to log block and log format versions Key: HUDI-8211 URL: https://issues.apache.org/jira/browse/HUDI-8211 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Chandar Assignee: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files
[ https://issues.apache.org/jira/browse/HUDI-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8210: - Story Points: 8 > [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log > files > -- > > Key: HUDI-8210 > URL: https://issues.apache.org/jira/browse/HUDI-8210 > Project: Apache Hudi > Issue Type: New Feature > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files
Vinoth Chandar created HUDI-8210: Summary: [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files Key: HUDI-8210 URL: https://issues.apache.org/jira/browse/HUDI-8210 Project: Apache Hudi Issue Type: New Feature Components: storage-management Reporter: Vinoth Chandar Assignee: Vinoth Chandar Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7227) Enable completion time for File Group Reader
[ https://issues.apache.org/jira/browse/HUDI-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7227: Assignee: Ethan Guo (was: Vinoth Chandar) > Enable completion time for File Group Reader > > > Key: HUDI-7227 > URL: https://issues.apache.org/jira/browse/HUDI-7227 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > For all query types, we should enable completion time semantic safely: > # Snapshot/RO (MOR, COW) > # Incremental (MOR, COW) > # TimeTravel (MOR, COW), > # CDC (MOR, COW) > # Bootstrap -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-8129) Bring back table version 6 archived timeline
[ https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-8129: Assignee: Balaji Varadarajan > Bring back table version 6 archived timeline > > > Key: HUDI-8129 > URL: https://issues.apache.org/jira/browse/HUDI-8129 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-8128) Bring back table version 6 active timeline
[ https://issues.apache.org/jira/browse/HUDI-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-8128: Assignee: Balaji Varadarajan > Bring back table version 6 active timeline > -- > > Key: HUDI-8128 > URL: https://issues.apache.org/jira/browse/HUDI-8128 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Balaji Varadarajan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-8127) Introduce active timeline implementation based on timeline layout version
[ https://issues.apache.org/jira/browse/HUDI-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-8127: Assignee: Balaji Varadarajan > Introduce active timeline implementation based on timeline layout version > - > > Key: HUDI-8127 > URL: https://issues.apache.org/jira/browse/HUDI-8127 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Balaji Varadarajan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . *Table Properties* All table properties need to be table version aware. *Timeline* Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. Typically, we have only written timeline with latest schema, we may have to version the .avsc files themselves and write using the table version specific avro specific class? Also we need a flexible mapping from action names used in each table version, to handle things like replacecommit to cluster. {*}TBD{*}: whether or not completion time based changes can be retained, assuming instant file creation timestamp. h4. *FileSystemView* Need to support file slice grouping based on old/new timeline + file naming combination. Also need abstractions for how files are named in each table version, to handle things like log file name changes. h4. *WriteHandle* This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) h4. Metadata table Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. *TBD* to see if any of the code around recent simplification needs to be undone. h4. LogFormat Reader/Writer the blocks and format itself is versioned, but its not yet tied to the table version overall. So we need these links so that the reader can for eg decide how to read/assemble log file scanning? FileGroupReader abstractions may need to change as well. h4. Table Service This should largely be independent and built on layers above. but some cases exist, depending on whether the code relies on format specifics for generating plans or execution. This may be a mix of code changes, acceptable behavior changes and format specific specializations. TBD: to see if behavior like writing rollback block needs to be was: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer com
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . *Table Properties* All table properties need to be table version aware. *Timeline* Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. Typically, we have only written timeline with latest schema, we may have to version the .avsc files themselves and write using the table version specific avro specific class? Also we need a flexible mapping from action names used in each table version, to handle things like replacecommit to cluster. {*}TBD{*}: whether or not completion time based changes can be retained, assuming instant file creation timestamp. h4. *FileSystemView* Need to support file slice grouping based on old/new timeline + file naming combination. Also need abstractions for how files are named in each table version, to handle things like log file name changes. h4. *WriteHandle* This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) h4. Metadata table Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. *TBD* to see if any of the code around recent simplification needs to be undone. h4. LogFormat Reader/Writer the blocks and format itself is version, but its not yet tied to the table version overall. So we need these links so that the reader can for eg decide how to read/assemble log file scanning? FileGroupReader abstractions may need to change as well. h4. Table Service This should largely be independent and built on layers above. but some cases exist, depending on whether the code relies on format specifics for generating plans or execution. This may be a mix of code changes, acceptable behavior changes and format specific specializations. TBD: to see if behavior like writing rollback block needs to be was: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer compl
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Status: In Progress (was: Open) > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > *(!) Work in Progress.* > h3. Basic Idea: > Introduce support for Hudi writer code to produce storage format for the last > 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue > reading the table even when writers are upgraded, as long as writer produces > storage compatible with the latest table version the reader can read. The > readers can then be rolling upgraded easily, at different cadence without > need for any tight co-ordination. Additionally, the reader should have > ability to "dynamically" deduce table version based on table properties, such > that when the writer is switched to the latest table version, subsequent > reads will just adapt and read it as the latest table version. > Operators still need to ensure all readers have the latest binary that > supports a given table version, before switching the writer to that version. > Special consideration to table services, as reader/writer processes, that > should be able manage the tables as well. Queries should gracefully fail > during table version switches and start eventually succeeding when writer > completes switching. Writers/table services should fail if working with an > unsupported table version, without which one cannot start switching writers > to new version (this may still need a minor release on the last 2-3 table > versions?) > h3. High level approach: > We need to introduce table version aware reading/writing inside the core > layers of Hudi, as well as query engines like Spark/Flink. > To this effect : We need a HoodieStorageFormat abstraction that can cover the > following layers . > *Table Properties* > All table properties need to be table version aware. > *Timeline* > Timeline already has a timeline layout version, which can be extended to > write older and newer timeline. The ArchivedTimeline can be old style or LSM > style, depending on table version. Typically, we have only written timeline > with latest schema, we may have to version the .avsc files themselves and > write using the table version specific avro specific class? Also we need a > flexible mapping from action names used in each table version, to handle > things like replacecommit to cluster. > {*}TBD{*}: whether or not completion time based changes can be retained, > assuming instant file creation timestamp. > h4. *FileSystemView* > Need to support file slice grouping based on old/new timeline + file naming > combination. Also need abstractions for how files are named in each table > version, to handle things like log file name changes. > h4. *WriteHandle* > This layer may or may not need changes, as base files don't really change. > and log format is already versioned (see next point). However, it's prudent > to ensure we have a mechanism for this, since there could be different ways > of encoding records or footers etc (e.g HFile k-v pairs, ...) > h4. Metadata table > Encoding of k-v pairs, their schemas etc.. which partitions are supported in > what table versions. *TBD* to see if any of the code around recent > simplification needs to be undone. > h4. LogFormat Reader/Writer > the blogs and format itself is version, but its not yet tied to the table > version overall. So we need these links so that the reader can for eg decide > how to read/assemble log file scanning? FileGroupReader abstractions may need > to change as well. > h4. Table Service > This should largely be independent and built on layers above. but some cases > exist, depending on whether the code relies on format specifics for > generating plans or execution. This may be a mix of code changes, acceptable > behavior changes and format specific specializations. > TBD: to see if behavior like writing rollback block needs to be -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format
[ https://issues.apache.org/jira/browse/HUDI-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8132: - Parent: (was: HUDI-8076) Issue Type: Improvement (was: Sub-task) > Implement support for writer to produce table version specific log > blocks/format > > > Key: HUDI-8132 > URL: https://issues.apache.org/jira/browse/HUDI-8132 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format
[ https://issues.apache.org/jira/browse/HUDI-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8132: - Epic Link: HUDI-7856 > Implement support for writer to produce table version specific log > blocks/format > > > Key: HUDI-8132 > URL: https://issues.apache.org/jira/browse/HUDI-8132 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8131) Associate log format and block versions to table versions
[ https://issues.apache.org/jira/browse/HUDI-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8131: - Fix Version/s: 1.0.0 > Associate log format and block versions to table versions > - > > Key: HUDI-8131 > URL: https://issues.apache.org/jira/browse/HUDI-8131 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8131) Associate log format and block versions to table versions
[ https://issues.apache.org/jira/browse/HUDI-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8131: - Epic Link: HUDI-7856 > Associate log format and block versions to table versions > - > > Key: HUDI-8131 > URL: https://issues.apache.org/jira/browse/HUDI-8131 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8131) Associate log format and block versions to table versions
[ https://issues.apache.org/jira/browse/HUDI-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8131: - Parent: (was: HUDI-8076) Issue Type: Improvement (was: Sub-task) > Associate log format and block versions to table versions > - > > Key: HUDI-8131 > URL: https://issues.apache.org/jira/browse/HUDI-8131 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8130) Introduce table version specific APIs for file name generation
[ https://issues.apache.org/jira/browse/HUDI-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8130: - Fix Version/s: 1.0.0 > Introduce table version specific APIs for file name generation > -- > > Key: HUDI-8130 > URL: https://issues.apache.org/jira/browse/HUDI-8130 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8130) Introduce table version specific APIs for file name generation
[ https://issues.apache.org/jira/browse/HUDI-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8130: - Parent: (was: HUDI-8076) Issue Type: Improvement (was: Sub-task) > Introduce table version specific APIs for file name generation > -- > > Key: HUDI-8130 > URL: https://issues.apache.org/jira/browse/HUDI-8130 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8130) Introduce table version specific APIs for file name generation
[ https://issues.apache.org/jira/browse/HUDI-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8130: - Epic Link: HUDI-7856 > Introduce table version specific APIs for file name generation > -- > > Key: HUDI-8130 > URL: https://issues.apache.org/jira/browse/HUDI-8130 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format
[ https://issues.apache.org/jira/browse/HUDI-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8132: - Fix Version/s: 1.0.0 > Implement support for writer to produce table version specific log > blocks/format > > > Key: HUDI-8132 > URL: https://issues.apache.org/jira/browse/HUDI-8132 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8127) Introduce active timeline implementation based on timeline layout version
[ https://issues.apache.org/jira/browse/HUDI-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8127: - Parent: (was: HUDI-8076) Issue Type: Improvement (was: Sub-task) > Introduce active timeline implementation based on timeline layout version > - > > Key: HUDI-8127 > URL: https://issues.apache.org/jira/browse/HUDI-8127 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8128) Bring back table version 6 active timeline
[ https://issues.apache.org/jira/browse/HUDI-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8128: - Epic Link: HUDI-7856 > Bring back table version 6 active timeline > -- > > Key: HUDI-8128 > URL: https://issues.apache.org/jira/browse/HUDI-8128 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline
[ https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8129: - Parent: (was: HUDI-8076) Issue Type: Improvement (was: Sub-task) > Bring back table version 6 archived timeline > > > Key: HUDI-8129 > URL: https://issues.apache.org/jira/browse/HUDI-8129 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8127) Introduce active timeline implementation based on timeline layout version
[ https://issues.apache.org/jira/browse/HUDI-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8127: - Epic Link: HUDI-7856 > Introduce active timeline implementation based on timeline layout version > - > > Key: HUDI-8127 > URL: https://issues.apache.org/jira/browse/HUDI-8127 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8128) Bring back table version 6 active timeline
[ https://issues.apache.org/jira/browse/HUDI-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8128: - Parent: (was: HUDI-8076) Issue Type: Improvement (was: Sub-task) > Bring back table version 6 active timeline > -- > > Key: HUDI-8128 > URL: https://issues.apache.org/jira/browse/HUDI-8128 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline
[ https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8129: - Fix Version/s: 1.0.0 > Bring back table version 6 archived timeline > > > Key: HUDI-8129 > URL: https://issues.apache.org/jira/browse/HUDI-8129 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline
[ https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8129: - Epic Link: HUDI-7856 > Bring back table version 6 archived timeline > > > Key: HUDI-8129 > URL: https://issues.apache.org/jira/browse/HUDI-8129 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8103) Introduce table version specific grouping of table configs
[ https://issues.apache.org/jira/browse/HUDI-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8103: - Parent: (was: HUDI-8076) Issue Type: Improvement (was: Sub-task) > Introduce table version specific grouping of table configs > -- > > Key: HUDI-8103 > URL: https://issues.apache.org/jira/browse/HUDI-8103 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > based on the value of hoodie.table.version, we need to write different set of > default table configs, and potentially different default values. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader
[ https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7888: Assignee: (was: Jonathan Vexler) > Throw meaningful error when reading partial update or DV written in 1.x from > 0.16.0 reader > -- > > Key: HUDI-7888 > URL: https://issues.apache.org/jira/browse/HUDI-7888 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > If 0.16.x reader is used to read 1.x table having partial updates/merges > enabled, we need to throw meaningful error to end user. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7971) Test and Certify 0.14.x tables are readable in 1.x Hudi reader
[ https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7971: Assignee: (was: Lokesh Jain) > Test and Certify 0.14.x tables are readable in 1.x Hudi reader > --- > > Key: HUDI-7971 > URL: https://issues.apache.org/jira/browse/HUDI-7971 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Priority: Major > Fix For: 1.0.0 > > > Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x > tables > > Readers : 1.x > # Spark SQL > # Spark Datasource > # Trino/Presto > # Hive > # Flink > Writer: 0.16 > Table State: > * COW > ** few write commits > ** Pending clustering > ** Completed Clustering > ** Failed writes with no rollbacks > ** Insert overwrite table/partition > ** Savepoint for Time-travel query > * MOR > ** Same as COW > ** Pending and completed async compaction (with log-files and no base file) > ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload) > ** Log block formats - DELETE, rollback block > Other knobs: > # Metadata enabled/disabled (all combinations) > # Column Stats enabled/disabled and data-skipping enabled/disabled > # RLI enabled with eq/IN queries > # Non-Partitioned dataset (all combinations) > # CDC Reads > # Incremental Reads > # Time-travel query > > What to test ? > # Query Results Correctness > # Performance : See the benefit of > # Partition Pruning > # Metadata table - col stats, RLI, > > Corner Case Testing: > > # Schema Evolution with different file-groups having different generation of > schema > # Dynamic Partition Pruning > # Does Column Projection work correctly for log files reading -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8103) Introduce table version specific grouping of table configs
[ https://issues.apache.org/jira/browse/HUDI-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8103: - Epic Link: HUDI-7856 > Introduce table version specific grouping of table configs > -- > > Key: HUDI-8103 > URL: https://issues.apache.org/jira/browse/HUDI-8103 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > based on the value of hoodie.table.version, we need to write different set of > default table configs, and potentially different default values. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
[ https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7972: Assignee: (was: sivabalan narayanan) > Add fallback for deletion vector in 0.16.x reader while reading 1.x tables > -- > > Key: HUDI-7972 > URL: https://issues.apache.org/jira/browse/HUDI-7972 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > Labels: 1.0-migration > Fix For: 1.0.0 > > > If 0.16.x reader is used to read a 1.x table with deletion vector, we should > fallback to using key based merges instead of position based merges. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7887) Any log format header types changes need to be ported to 0.16.0 from 1.x
[ https://issues.apache.org/jira/browse/HUDI-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7887: Assignee: (was: Jonathan Vexler) > Any log format header types changes need to be ported to 0.16.0 from 1.x > > > Key: HUDI-7887 > URL: https://issues.apache.org/jira/browse/HUDI-7887 > Project: Apache Hudi > Issue Type: Sub-task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > Port any new log header metadata types introduced in 1.x to 0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7886) Make metadata payload from 1.x readable in 0.16.0
[ https://issues.apache.org/jira/browse/HUDI-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7886: Assignee: (was: Lokesh Jain) > Make metadata payload from 1.x readable in 0.16.0 > - > > Key: HUDI-7886 > URL: https://issues.apache.org/jira/browse/HUDI-7886 > Project: Apache Hudi > Issue Type: Sub-task > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > > We wanted to support reading 1.x tables in 0.16.0 reader. > > So, lets port over all metadata payload schema changes to 0.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7882) RFC-78 for Bridge Release
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877561#comment-17877561 ] Vinoth Chandar commented on HUDI-7882: -- I think we are discontinuing this approach over https://issues.apache.org/jira/browse/HUDI-8076 > RFC-78 for Bridge Release > -- > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > We have 4 major goals w/ this umbrella ticket. > a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for > all query types. > b. 0.16.x should be capable of reading 1.x tables for most features > c. Upgrade 0.16.x to 1.x > d. Downgrade 1.x to 0.16.0. > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > RFC in progress: [https://github.com/apache/hudi/pull/11514] > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7882) RFC-78 for Bridge Release
[ https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7882: - Summary: RFC-78 for Bridge Release (was: Umbrella ticket for 1.x tables and 0.16.x compatibility) > RFC-78 for Bridge Release > -- > > Key: HUDI-7882 > URL: https://issues.apache.org/jira/browse/HUDI-7882 > Project: Apache Hudi > Issue Type: Improvement > Components: reader-core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > We have 4 major goals w/ this umbrella ticket. > a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for > all query types. > b. 0.16.x should be capable of reading 1.x tables for most features > c. Upgrade 0.16.x to 1.x > d. Downgrade 1.x to 0.16.0. > > > We wanted to support reading 1.x tables in 0.16.0 release. So, creating this > umbrella ticket to track all of them. > > RFC in progress: [https://github.com/apache/hudi/pull/11514] > > Changes required to be ported: > 0. Creating 0.16.0 branch > 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. > > 1. Timeline > 1.a Hoodie instant parsing should be able to read 1.x instants. > https://issues.apache.org/jira/browse/HUDI-7883 Sagar. > 1.b Commit metadata parsing is able to handle both json and avro formats. > Scope might be non-trivial. https://issues.apache.org/jira/browse/HUDI-7866 > Siva. > 1.c HoodieDefaultTimeline able to read both timelines based on table version. > https://issues.apache.org/jira/browse/HUDI-7884 Siva. > 1.d Reading LSM timeline using 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7890 Siva. > 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901 > > 2. Table property changes > 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885 > https://issues.apache.org/jira/browse/HUDI-7865 LJ > > 3. MDT table changes > 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ > 3.b MDT payload schema changes. > https://issues.apache.org/jira/browse/HUDI-7886 LJ > > 4. Log format changes > 4.a All metadata header types porting > https://issues.apache.org/jira/browse/HUDI-7887 Jon > 4.b Meaningful error for incompatible features from 1.x > https://issues.apache.org/jira/browse/HUDI-7888 Jon > > 5. Log file slice or grouping detection compatibility > > 5. Tests > 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 > https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. > > 6 Doc changes > 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. > https://issues.apache.org/jira/browse/HUDI-7889 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format
Vinoth Chandar created HUDI-8132: Summary: Implement support for writer to produce table version specific log blocks/format Key: HUDI-8132 URL: https://issues.apache.org/jira/browse/HUDI-8132 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8131) Associate log format and block versions to table versions
Vinoth Chandar created HUDI-8131: Summary: Associate log format and block versions to table versions Key: HUDI-8131 URL: https://issues.apache.org/jira/browse/HUDI-8131 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8130) Introduce table version specific APIs for file name generation
Vinoth Chandar created HUDI-8130: Summary: Introduce table version specific APIs for file name generation Key: HUDI-8130 URL: https://issues.apache.org/jira/browse/HUDI-8130 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline
[ https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8129: - Summary: Bring back table version 6 archived timeline (was: Bring back table version 6 active timeline) > Bring back table version 6 archived timeline > > > Key: HUDI-8129 > URL: https://issues.apache.org/jira/browse/HUDI-8129 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8129) Bring back table version 6 active timeline
Vinoth Chandar created HUDI-8129: Summary: Bring back table version 6 active timeline Key: HUDI-8129 URL: https://issues.apache.org/jira/browse/HUDI-8129 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8128) Bring back table version 6 active timeline
Vinoth Chandar created HUDI-8128: Summary: Bring back table version 6 active timeline Key: HUDI-8128 URL: https://issues.apache.org/jira/browse/HUDI-8128 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8127) Introduce active timeline implementation based on timeline layout version
Vinoth Chandar created HUDI-8127: Summary: Introduce active timeline implementation based on timeline layout version Key: HUDI-8127 URL: https://issues.apache.org/jira/browse/HUDI-8127 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinoth Chandar -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8103) Introduce table version specific grouping of table configs
[ https://issues.apache.org/jira/browse/HUDI-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8103: - Description: based on the value of hoodie.table.version, we need to write different set of default table configs, and potentially different default values. > Introduce table version specific grouping of table configs > -- > > Key: HUDI-8103 > URL: https://issues.apache.org/jira/browse/HUDI-8103 > Project: Apache Hudi > Issue Type: Sub-task > Components: core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > based on the value of hoodie.table.version, we need to write different set of > default table configs, and potentially different default values. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-8103) Introduce table version specific grouping of table configs
Vinoth Chandar created HUDI-8103: Summary: Introduce table version specific grouping of table configs Key: HUDI-8103 URL: https://issues.apache.org/jira/browse/HUDI-8103 Project: Apache Hudi Issue Type: Sub-task Components: core Reporter: Vinoth Chandar Assignee: Vinoth Chandar Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . *Table Properties* All table properties need to be table version aware. *Timeline* Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. Typically, we have only written timeline with latest schema, we may have to version the .avsc files themselves and write using the table version specific avro specific class? Also we need a flexible mapping from action names used in each table version, to handle things like replacecommit to cluster. {*}TBD{*}: whether or not completion time based changes can be retained, assuming instant file creation timestamp. h4. *FileSystemView* Need to support file slice grouping based on old/new timeline + file naming combination. Also need abstractions for how files are named in each table version, to handle things like log file name changes. h4. *WriteHandle* This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) h4. Metadata table Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. *TBD* to see if any of the code around recent simplification needs to be undone. h4. LogFormat Reader/Writer the blogs and format itself is version, but its not yet tied to the table version overall. So we need these links so that the reader can for eg decide how to read/assemble log file scanning? FileGroupReader abstractions may need to change as well. h4. Table Service This should largely be independent and built on layers above. but some cases exist, depending on whether the code relies on format specifics for generating plans or execution. This may be a mix of code changes, acceptable behavior changes and format specific specializations. TBD: to see if behavior like writing rollback block needs to be was: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer comple
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . *Table Properties* ** All table properties need to be table version aware. *Timeline* Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. Typically, we have only written timeline with latest schema, we may have to version the .avsc files themselves and write using the table version specific avro specific class? Also we need a flexible mapping from action names used in each table version, to handle things like replacecommit to cluster. {*}TBD{*}: whether or not completion time based changes can be retained, assuming instant file creation timestamp. h4. *FileSystemView* Need to support file slice grouping based on old/new timeline + file naming combination. Also need abstractions for how files are named in each table version, to handle things like log file name changes. h4. *WriteHandle* This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) h4. Metadata table Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. *TBD* to see if any of the code around recent simplification needs to be undone. h4. LogFormat Reader/Writer the blogs and format itself is version, but its not yet tied to the table version overall. So we need these links so that the reader can for eg decide how to read/assemble log file scanning? FileGroupReader abstractions may need to change as well. h4. Table Service This should largely be independent and built on layers above. but some cases exist, depending on whether the code relies on format specifics for generating plans or execution. This may be a mix of code changes, acceptable behavior changes and format specific specializations. TBD: to see if behavior like writing rollback block needs to be was: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . Timeline : Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. {*}TBD{*}: whether or not completion time based changes can be retained, assuming instant file creation timestamp. WriteHandle : This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) Metadata table: Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. *TBD* to see if any of the code around recent simplification needs to be undone. LogFormat Reader/Writer : the blogs and format itself is version, but its not yet tied to the table version overall. So we need these links so that the reader can for eg decide how to read/assemble log file scanning? FileGroupReader abstractions may need to change as well. was: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . # Timeline : Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. {*}TBD{*}: whether or not completion time based changes can be retained, assuming instant file creation timestamp. # WriteHandle : This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's p
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: *(!) Work in Progress.* h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . # Timeline : Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. {*}TBD{*}: whether or not completion time based changes can be retained, assuming instant file creation timestamp. # WriteHandle : This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) # Metadata table: Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. *TBD* to see if any of the code around recent simplification needs to be undone. # LogFormat Reader/Writer : the blogs and format itself is version, but its not yet tied to the table version overall. So we need these links so that the reader can for eg decide how to read/assemble log file scanning? FileGroupReader abstractions may need to change as well. was: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . # Timeline : Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. {*}tbd{*}: whether or not completion time based changes # WriteHandle : This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different wa
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . # Timeline : Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. {*}tbd{*}: whether or not completion time based changes # WriteHandle : This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) # Metadata table: Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. *TBD* to see if any of the code thatt was: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . # Timeline : Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. tbd: whether or not completion time based changes # WriteHandle : This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) # Metadata table: Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. TBD to see if any of the code thatt > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. Writers/table services should fail if working with an unsupported table version, without which one cannot start switching writers to new version (this may still need a minor release on the last 2-3 table versions?) h3. High level approach: We need to introduce table version aware reading/writing inside the core layers of Hudi, as well as query engines like Spark/Flink. To this effect : We need a HoodieStorageFormat abstraction that can cover the following layers . # Timeline : Timeline already has a timeline layout version, which can be extended to write older and newer timeline. The ArchivedTimeline can be old style or LSM style, depending on table version. tbd: whether or not completion time based changes # WriteHandle : This layer may or may not need changes, as base files don't really change. and log format is already versioned (see next point). However, it's prudent to ensure we have a mechanism for this, since there could be different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) # Metadata table: Encoding of k-v pairs, their schemas etc.. which partitions are supported in what table versions. TBD to see if any of the code thatt was: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. h3. High level approach: > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > (!) Work in Progress. > h3. Basic Idea: > Introduce support for Hudi writer code to produce storage format for the last > 2-3 table versions. This enables older readers to continue reading the table > even when writers are upgraded, as long as writer produces storage compatible > with the latest table version the reader can read. The readers can then be > rolling upgraded easily, at different cadence without need for any tight > co-ordination. Additionally, the reader should have ability to "dynamically" > deduce table version based on table properties, such that when the writer is > switched to the latest table version, subsequent reads will just adapt and > read it as the latest table version. > Operators still need to ensure all readers have the latest binary that > supports a given table version, before switching the writer to that version. > Special consideration to table services, as reader/writer processes, tha
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to the latest table version, subsequent reads will just adapt and read it as the latest table version. Operators still need to ensure all readers have the latest binary that supports a given table version, before switching the writer to that version. Special consideration to table services, as reader/writer processes, that should be able manage the tables as well. Queries should gracefully fail during table version switches and start eventually succeeding when writer completes switching. h3. High level approach: was: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to h3. High level approach: > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > (!) Work in Progress. > h3. Basic Idea: > Introduce support for Hudi writer code to produce storage format for the last > 2-3 table versions. This enables older readers to continue reading the table > even when writers are upgraded, as long as writer produces storage compatible > with the latest table version the reader can read. The readers can then be > rolling upgraded easily, at different cadence without need for any tight > co-ordination. Additionally, the reader should have ability to "dynamically" > deduce table version based on table properties, such that when the writer is > switched to the latest table version, subsequent reads will just adapt and > read it as the latest table version. > Operators still need to ensure all readers have the latest binary that > supports a given table version, before switching the writer to that version. > Special consideration to table services, as reader/writer processes, that > should be able manage the tables as well. Queries should gracefully fail > during table version switches and start eventually succeeding when writer > completes switching. > h3. High level approach: -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. The readers can then be rolling upgraded easily, at different cadence without need for any tight co-ordination. Additionally, the reader should have ability to "dynamically" deduce table version based on table properties, such that when the writer is switched to h3. High level approach: was: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. Additionally, the reader should have ability to "dynamically" deduce table version h3. High level approach: > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > (!) Work in Progress. > h3. Basic Idea: > Introduce support for Hudi writer code to produce storage format for the last > 2-3 table versions. This enables older readers to continue reading the table > even when writers are upgraded, as long as writer produces storage compatible > with the latest table version the reader can read. The readers can then be > rolling upgraded easily, at different cadence without need for any tight > co-ordination. Additionally, the reader should have ability to "dynamically" > deduce table version based on table properties, such that when the writer is > switched to > h3. > High level approach: -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers to continue reading the table even when writers are upgraded, as long as writer produces storage compatible with the latest table version the reader can read. Additionally, the reader should have ability to "dynamically" deduce table version h3. High level approach: was: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers can continue to read the table even when writers are upgraded. h3. High level approach: > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > (!) Work in Progress. > h3. Basic Idea: > Introduce support for Hudi writer code to produce storage format for the last > 2-3 table versions. This enables older readers to continue reading the table > even when writers are upgraded, as long as writer produces storage compatible > with the latest table version the reader can read. Additionally, the reader > should have ability to "dynamically" deduce table version > h3. > High level approach: -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-8076: - Description: (!) Work in Progress. h3. Basic Idea: Introduce support for Hudi writer code to produce storage format for the last 2-3 table versions. This enables older readers can continue to read the table even when writers are upgraded. h3. High level approach: > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > (!) Work in Progress. > h3. Basic Idea: > Introduce support for Hudi writer code to produce storage format for the last > 2-3 table versions. This enables older readers can continue to read the table > even when writers are upgraded. > h3. > High level approach: -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874995#comment-17874995 ] Vinoth Chandar commented on HUDI-8076: -- I think we can reuse RFC-78 unless we plan to support both a bridge release and support for writing older table versions. > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0
[ https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-8076: Assignee: Vinoth Chandar > RFC for backwards compatible writer mode in Hudi 1.0 > > > Key: HUDI-8076 > URL: https://issues.apache.org/jira/browse/HUDI-8076 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7585) Avoid reading log files for resolving schema for _hoodie_operation field
[ https://issues.apache.org/jira/browse/HUDI-7585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7585: Assignee: Danny Chen > Avoid reading log files for resolving schema for _hoodie_operation field > > > Key: HUDI-7585 > URL: https://issues.apache.org/jira/browse/HUDI-7585 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6713) Redesign CDC workload to include partition column for partition pruning
[ https://issues.apache.org/jira/browse/HUDI-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-6713: Assignee: Danny Chen (was: Vinoth Chandar) > Redesign CDC workload to include partition column for partition pruning > --- > > Key: HUDI-6713 > URL: https://issues.apache.org/jira/browse/HUDI-6713 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Aditya Goenka >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > Currently our cdc columns have all the columns nested under before and after. > Redesign schema to pull up partition column outside to support partition > pruning on read side. > > Github issue - [https://github.com/apache/hudi/issues/9426] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6778) Track schema in metadata table
[ https://issues.apache.org/jira/browse/HUDI-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6778: - Status: In Progress (was: Open) > Track schema in metadata table > -- > > Key: HUDI-6778 > URL: https://issues.apache.org/jira/browse/HUDI-6778 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates
[ https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844069#comment-17844069 ] Vinoth Chandar commented on HUDI-7234: -- this to be handled in 1.1.0 along with partial update encoding cross records > Handle both inserts and updates in log blocks for partial updates > - > > Key: HUDI-7234 > URL: https://issues.apache.org/jira/browse/HUDI-7234 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > Inserts can be written to log blocks, e.g., Flink. We need to handle such > case for partial updates i.e mix of inserts and partial updates to the same > data block. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates
[ https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7234: - Status: Open (was: In Progress) > Handle both inserts and updates in log blocks for partial updates > - > > Key: HUDI-7234 > URL: https://issues.apache.org/jira/browse/HUDI-7234 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > Inserts can be written to log blocks, e.g., Flink. We need to handle such > case for partial updates i.e mix of inserts and partial updates to the same > data block. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates
[ https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7234: - Fix Version/s: 1.1.0 (was: 1.0.0) > Handle both inserts and updates in log blocks for partial updates > - > > Key: HUDI-7234 > URL: https://issues.apache.org/jira/browse/HUDI-7234 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.1.0 > > > Inserts can be written to log blocks, e.g., Flink. We need to handle such > case for partial updates i.e mix of inserts and partial updates to the same > data block. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7541) Ensure extensibility to new indexes - vectors, search and other formats (CLP, unstructured data)
[ https://issues.apache.org/jira/browse/HUDI-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7541: - Fix Version/s: 1.1.0 (was: 1.0.0) > Ensure extensibility to new indexes - vectors, search and other formats (CLP, > unstructured data) > > > Key: HUDI-7541 > URL: https://issues.apache.org/jira/browse/HUDI-7541 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7541) Ensure extensibility to new indexes - vectors, search and other formats (CLP, unstructured data)
[ https://issues.apache.org/jira/browse/HUDI-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844068#comment-17844068 ] Vinoth Chandar commented on HUDI-7541: -- Punting this to 1.1 > Ensure extensibility to new indexes - vectors, search and other formats (CLP, > unstructured data) > > > Key: HUDI-7541 > URL: https://issues.apache.org/jira/browse/HUDI-7541 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7679) Ensure extensibility to unstructured data, logs (CLP), vectors, other index types
[ https://issues.apache.org/jira/browse/HUDI-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar closed HUDI-7679. Resolution: Duplicate > Ensure extensibility to unstructured data, logs (CLP), vectors, other index > types > - > > Key: HUDI-7679 > URL: https://issues.apache.org/jira/browse/HUDI-7679 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)
[ https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7538: - Fix Version/s: (was: 1.1.0) > Consolidate the CDC Formats (changelog format, RFC-51) > -- > > Key: HUDI-7538 > URL: https://issues.apache.org/jira/browse/HUDI-7538 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > For sake of more consistency, we need to consolidate the the changelog mode > (currently supported for Flink MoR) and RFC-51 based CDC feature which is a > debezium style change log (currently supported for CoW for Spark/Flink) > > |Format Name|CDC Source Required|Resource Cost(writer)|Resource > Cost(reader)|Friendly to Streaming| > |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the > debezium style output is not what Flink needs for e.g)| > |Changelog|Yes|low|low|Yes| > This proposal is to converge onto "CDC" as the path going forward, with the > following changes to incorporated for supporting existing users/usage of > changelog. CDC format is more generalized in the database world. It offers > advantages like not requiring further down-stream processing to say stitch > together +U and -U, to update a downstream table. for e.g a field that > changed is a key in a downstream table, so we need both +U and -U to compute > the updates. > > (A) Introduce a new "changelog" output mode for CDC queries, which generates > I,+U,-U,D format that changelog needs (this can be constructed easily by > processing the output of CDC query as follows) > * when before is `null`, emit I > * when after is `null`, emit D > * when both are non-null, emit two records +U and -U > (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops > publishing to _hoodie_operation field > # this means, anyone querying this field, using a snapshot query, will break. > # we will bring this back in 1.1 etc, based on user feedback as a > hidden/field in the FlinkCatalog. > (C) To support backwards compatibilty, we fallback to reading > `_hoodie_operation` in 0.X tables. > For CDC reads, we use first use the CDC log if its avaible for that file > slice. If not and base file schema has {{_hoodie_operation}} already, we > fallback to reading {{_hoodie_operation}} from base file if > mode=OP_KEY_ONLY.. Throw error for other modes. > (D) Snapshot queries from spark, presto, trino etc all work with tables, that > have `_hoodie_operation` published. > This is already completed for Spark. so others should be easy to do. > > (E) We need to complete a review of the CDC schema > ts - should be completion time or instant time? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7679) Ensure extensibility to unstructured data, logs (CLP), vectors, other index types
[ https://issues.apache.org/jira/browse/HUDI-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved HUDI-7679. -- > Ensure extensibility to unstructured data, logs (CLP), vectors, other index > types > - > > Key: HUDI-7679 > URL: https://issues.apache.org/jira/browse/HUDI-7679 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates
[ https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7234: - Status: In Progress (was: Open) > Handle both inserts and updates in log blocks for partial updates > - > > Key: HUDI-7234 > URL: https://issues.apache.org/jira/browse/HUDI-7234 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > Inserts can be written to log blocks, e.g., Flink. We need to handle such > case for partial updates i.e mix of inserts and partial updates to the same > data block. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7541) Ensure extensibility to new indexes - vectors, search and other formats (CLP, unstructured data)
[ https://issues.apache.org/jira/browse/HUDI-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7541: - Status: In Progress (was: Open) > Ensure extensibility to new indexes - vectors, search and other formats (CLP, > unstructured data) > > > Key: HUDI-7541 > URL: https://issues.apache.org/jira/browse/HUDI-7541 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7679) Ensure extensibility to unstructured data, logs (CLP), vectors, other index types
[ https://issues.apache.org/jira/browse/HUDI-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7679: - Status: In Progress (was: Open) > Ensure extensibility to unstructured data, logs (CLP), vectors, other index > types > - > > Key: HUDI-7679 > URL: https://issues.apache.org/jira/browse/HUDI-7679 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)
[ https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7538: - Fix Version/s: 1.1.0 > Consolidate the CDC Formats (changelog format, RFC-51) > -- > > Key: HUDI-7538 > URL: https://issues.apache.org/jira/browse/HUDI-7538 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.1.0, 1.0.0 > > > For sake of more consistency, we need to consolidate the the changelog mode > (currently supported for Flink MoR) and RFC-51 based CDC feature which is a > debezium style change log (currently supported for CoW for Spark/Flink) > > |Format Name|CDC Source Required|Resource Cost(writer)|Resource > Cost(reader)|Friendly to Streaming| > |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the > debezium style output is not what Flink needs for e.g)| > |Changelog|Yes|low|low|Yes| > This proposal is to converge onto "CDC" as the path going forward, with the > following changes to incorporated for supporting existing users/usage of > changelog. CDC format is more generalized in the database world. It offers > advantages like not requiring further down-stream processing to say stitch > together +U and -U, to update a downstream table. for e.g a field that > changed is a key in a downstream table, so we need both +U and -U to compute > the updates. > > (A) Introduce a new "changelog" output mode for CDC queries, which generates > I,+U,-U,D format that changelog needs (this can be constructed easily by > processing the output of CDC query as follows) > * when before is `null`, emit I > * when after is `null`, emit D > * when both are non-null, emit two records +U and -U > (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops > publishing to _hoodie_operation field > # this means, anyone querying this field, using a snapshot query, will break. > # we will bring this back in 1.1 etc, based on user feedback as a > hidden/field in the FlinkCatalog. > (C) To support backwards compatibilty, we fallback to reading > `_hoodie_operation` in 0.X tables. > For CDC reads, we use first use the CDC log if its avaible for that file > slice. If not and base file schema has {{_hoodie_operation}} already, we > fallback to reading {{_hoodie_operation}} from base file if > mode=OP_KEY_ONLY.. Throw error for other modes. > (D) Snapshot queries from spark, presto, trino etc all work with tables, that > have `_hoodie_operation` published. > This is already completed for Spark. so others should be easy to do. > > (E) We need to complete a review of the CDC schema > ts - should be completion time or instant time? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7540) Check for gaps on storing inserts on log files
[ https://issues.apache.org/jira/browse/HUDI-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved HUDI-7540. -- > Check for gaps on storing inserts on log files > -- > > Key: HUDI-7540 > URL: https://issues.apache.org/jira/browse/HUDI-7540 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates
[ https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7234: - Description: Inserts can be written to log blocks, e.g., Flink. We need to handle such case for partial updates i.e mix of inserts and partial updates to the same data block. (was: Inserts can be written to log blocks, e.g., Flink. We need to handle such case for partial updates.) > Handle both inserts and updates in log blocks for partial updates > - > > Key: HUDI-7234 > URL: https://issues.apache.org/jira/browse/HUDI-7234 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > Inserts can be written to log blocks, e.g., Flink. We need to handle such > case for partial updates i.e mix of inserts and partial updates to the same > data block. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7540) Check for gaps on storing inserts on log files
[ https://issues.apache.org/jira/browse/HUDI-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar closed HUDI-7540. Resolution: Invalid > Check for gaps on storing inserts on log files > -- > > Key: HUDI-7540 > URL: https://issues.apache.org/jira/browse/HUDI-7540 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7229) Enable partial updates for CDC work payload
[ https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7229: - Description: OLTP workloads on upstream databases, often update/delete/insert different columns in the table on each operation. Currently, Hudi can only supporting partial updates in cases where the same columns are being mutated in a given write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we explore what it takes to support a smarter storage format, that can only encode the changed columns into log along with the different implementations. h2. Goals # Enable partial update functionality for all existing and potential future CDC workloads without huge modification or duplication. # Performance parity with current full-record updates or partial updates across the same set of columns # Exhibit reduction in storage costs, by only storing the changed columns. # Should also result in computation cost reductions by scanning/processing less data # Should not affect the scalability of the existing system ingestion system. The number of files generated for partial update should not increase dramatically. was:DMS, Debezium, etc. > Enable partial updates for CDC work payload > --- > > Key: HUDI-7229 > URL: https://issues.apache.org/jira/browse/HUDI-7229 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Vinoth Chandar >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > > OLTP workloads on upstream databases, often update/delete/insert different > columns in the table on each operation. Currently, Hudi can only supporting > partial updates in cases where the same columns are being mutated in a given > write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we > explore what it takes to support a smarter storage format, that can only > encode the changed columns into log along with the different implementations. > h2. Goals > # Enable partial update functionality for all existing and potential future > CDC workloads without huge modification or duplication. > # Performance parity with current full-record updates or partial updates > across the same set of columns > # Exhibit reduction in storage costs, by only storing the changed columns. > # Should also result in computation cost reductions by scanning/processing > less data > # Should not affect the scalability of the existing system ingestion system. > The number of files generated for partial update should not increase > dramatically. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7229) Enable partial updates for CDC work payload
[ https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843110#comment-17843110 ] Vinoth Chandar commented on HUDI-7229: -- Punting this to 1.1 # [1.1] Implement support on top of data blocks. ## we need to pass change columns information and operation all the way to write handles, using a field in HoodieRecord ## ... # [1.1] Implement support on top of cdc data blocks. ## we can track similar bitmaps for cdc data blocks as well ## we need to extend the new file group reader to also merge base and cdc blocks. (not just base and data blocks). > Enable partial updates for CDC work payload > --- > > Key: HUDI-7229 > URL: https://issues.apache.org/jira/browse/HUDI-7229 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Vinoth Chandar >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > > OLTP workloads on upstream databases, often update/delete/insert different > columns in the table on each operation. Currently, Hudi can only supporting > partial updates in cases where the same columns are being mutated in a given > write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we > explore what it takes to support a smarter storage format, that can only > encode the changed columns into log along with the different implementations. > h2. Goals > # Enable partial update functionality for all existing and potential future > CDC workloads without huge modification or duplication. > # Performance parity with current full-record updates or partial updates > across the same set of columns > # Exhibit reduction in storage costs, by only storing the changed columns. > # Should also result in computation cost reductions by scanning/processing > less data > # Should not affect the scalability of the existing system ingestion system. > The number of files generated for partial update should not increase > dramatically. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7229) Enable partial updates for CDC work payload
[ https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7229: - Fix Version/s: 1.1.0 (was: 1.0.0) > Enable partial updates for CDC work payload > --- > > Key: HUDI-7229 > URL: https://issues.apache.org/jira/browse/HUDI-7229 > Project: Apache Hudi > Issue Type: Task >Reporter: Lin Liu >Assignee: Vinoth Chandar >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > > DMS, Debezium, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7671) Make Hudi timeline backward compatible
[ https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843109#comment-17843109 ] Vinoth Chandar commented on HUDI-7671: -- balaji - this may be a dupe. > Make Hudi timeline backward compatible > -- > > Key: HUDI-7671 > URL: https://issues.apache.org/jira/browse/HUDI-7671 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Balaji Varadarajan >Priority: Major > Labels: compatibility > Fix For: 1.0.0 > > > Since release 1.x, the timeline metadata file name is changed to include the > completion time, we need to keep compatibility for 0.x branches/releases. > 0.x meta file name pattern: ${instant_time}.action[.state] > 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state]. > In 1.x release, while decipher the Hudi instant from the metadata files, if > there is no completion time, uses the file modification time as the > completion time instead. > The modification time follows the OCC concurrency control semantics if the > files were not moved around. > Caution that if the table is a MOR table and the files got moved in history > from old folder to the current folder, the reader view may represent wong > result set because the completion time are completely the same for all the > alive instants. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7671) Make Hudi timeline backward compatible
[ https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7671: - Epic Link: HUDI-6242 > Make Hudi timeline backward compatible > -- > > Key: HUDI-7671 > URL: https://issues.apache.org/jira/browse/HUDI-7671 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: compatibility > Fix For: 1.0.0 > > > Since release 1.x, the timeline metadata file name is changed to include the > completion time, we need to keep compatibility for 0.x branches/releases. > 0.x meta file name pattern: ${instant_time}.action[.state] > 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state]. > In 1.x release, while decipher the Hudi instant from the metadata files, if > there is no completion time, uses the file modification time as the > completion time instead. > The modification time follows the OCC concurrency control semantics if the > files were not moved around. > Caution that if the table is a MOR table and the files got moved in history > from old folder to the current folder, the reader view may represent wong > result set because the completion time are completely the same for all the > alive instants. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7671) Make Hudi timeline backward compatible
[ https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7671: Assignee: Balaji Varadarajan (was: Danny Chen) > Make Hudi timeline backward compatible > -- > > Key: HUDI-7671 > URL: https://issues.apache.org/jira/browse/HUDI-7671 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Balaji Varadarajan >Priority: Major > Labels: compatibility > Fix For: 1.0.0 > > > Since release 1.x, the timeline metadata file name is changed to include the > completion time, we need to keep compatibility for 0.x branches/releases. > 0.x meta file name pattern: ${instant_time}.action[.state] > 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state]. > In 1.x release, while decipher the Hudi instant from the metadata files, if > there is no completion time, uses the file modification time as the > completion time instead. > The modification time follows the OCC concurrency control semantics if the > files were not moved around. > Caution that if the table is a MOR table and the files got moved in history > from old folder to the current folder, the reader view may represent wong > result set because the completion time are completely the same for all the > alive instants. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7678) Finalize the Merger APIs and make a plan for moving over all existing built-in, custom payloads.
[ https://issues.apache.org/jira/browse/HUDI-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7678: - Description: With the move towards making partial updates a first class citizen, that does not need any special payloads/merges, we need to move the CDC payloads to all be transformers in Hudi Streamer and SQL write path. Along with migration instructions to users. # partial update has been implemented for Spark SQL source as follows: ## Configuration \{{ hoodie.write.partial.update.schema }} is used for partial update. ## {{ExpressionPayload}} creates the writer schema based on the configuration. ## {{HoodieAppendHandle}} creates the log file based on the confgiuration and the corresponding partial schema. ## Currently this handle assumes these records are all update records. ## We need to understand if ExpressionPayload/SQL Merger is needed to going forward. # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. Therefore, ## The {{transformer}} in DeltaStreamer prepares the data according to the types of the sources. ## Initially, its okay to just support full row updates/deletes/... # Audit all of them should properly combine I/U/D into data and delete blocks, such that U after D, D after U scenarios are handled as expected. was: With the move towards making partial updates a first class citizen, that does not need any special payloads/merges, we need to move the CDC payloads to all be transformers in Hudi Streamer and SQL write path. Along with migration instructions to users. # partial update has been implemented for Spark SQL source as follows: ## Configuration {{ hoodie.write.partial.update.schema }} is used for partial update. ## {{ExpressionPayload}} creates the writer schema based on the configuration. ## {{HoodieAppendHandle}} creates the log file based on the confgiuration and the corresponding partial schema. ## Currently this handle assumes these records are all update records. ## We need to understand if ExpressionPayload/SQL Merger is needed to going forward. # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. Therefore, ## The {{transformer}} in DeltaStreamer prepares the data according to the types of the sources. ## Initially, its okay to just support full row updates/deletes/... > Finalize the Merger APIs and make a plan for moving over all existing > built-in, custom payloads. > > > Key: HUDI-7678 > URL: https://issues.apache.org/jira/browse/HUDI-7678 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > With the move towards making partial updates a first class citizen, that does > not need any special payloads/merges, we need to move the CDC payloads to all > be transformers in Hudi Streamer and SQL write path. Along with migration > instructions to users. > # partial update has been implemented for Spark SQL source as follows: > ## Configuration \{{ hoodie.write.partial.update.schema }} is used for > partial update. > ## {{ExpressionPayload}} creates the writer schema based on the > configuration. > ## {{HoodieAppendHandle}} creates the log file based on the confgiuration > and the corresponding partial schema. > ## Currently this handle assumes these records are all update records. > ## We need to understand if ExpressionPayload/SQL Merger is needed to going > forward. > # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., > Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. > Therefore, > ## The {{transformer}} in DeltaStreamer prepares the data according to the > types of the sources. > ## Initially, its okay to just support full row updates/deletes/... > # Audit all of them should properly combine I/U/D into data and delete > blocks, such that U after D, D after U scenarios are handled as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7678) Finalize the Merger APIs and make a plan for moving over all existing built-in, custom payloads.
[ https://issues.apache.org/jira/browse/HUDI-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7678: - Description: With the move towards making partial updates a first class citizen, that does not need any special payloads/merges, we need to move the CDC payloads to all be transformers in Hudi Streamer and SQL write path. Along with migration instructions to users. # partial update has been implemented for Spark SQL source as follows: ## Configuration {{ hoodie.write.partial.update.schema }} is used for partial update. ## {{ExpressionPayload}} creates the writer schema based on the configuration. ## {{HoodieAppendHandle}} creates the log file based on the confgiuration and the corresponding partial schema. ## Currently this handle assumes these records are all update records. ## We need to understand if ExpressionPayload/SQL Merger is needed to going forward. # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. Therefore, ## The {{transformer}} in DeltaStreamer prepares the data according to the types of the sources. ## Initially, its okay to just support full row updates/deletes/... > Finalize the Merger APIs and make a plan for moving over all existing > built-in, custom payloads. > > > Key: HUDI-7678 > URL: https://issues.apache.org/jira/browse/HUDI-7678 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Fix For: 1.0.0 > > > With the move towards making partial updates a first class citizen, that does > not need any special payloads/merges, we need to move the CDC payloads to all > be transformers in Hudi Streamer and SQL write path. Along with migration > instructions to users. > # partial update has been implemented for Spark SQL source as follows: > ## Configuration {{ hoodie.write.partial.update.schema }} is used for > partial update. > ## {{ExpressionPayload}} creates the writer schema based on the > configuration. > ## {{HoodieAppendHandle}} creates the log file based on the confgiuration > and the corresponding partial schema. > ## Currently this handle assumes these records are all update records. > ## We need to understand if ExpressionPayload/SQL Merger is needed to going > forward. > # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., > Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. > Therefore, > ## The {{transformer}} in DeltaStreamer prepares the data according to the > types of the sources. > ## Initially, its okay to just support full row updates/deletes/... -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7665) Rolling upgrade of 1.0
[ https://issues.apache.org/jira/browse/HUDI-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7665: - Description: We need to update the table version due to the format changes in 1.0. | * Plan to get 1.x readers to read 0.x tables * Rollout plan for users * [Writer side] Migrating log files, timeline, metadata * Migrating table properties including key generators, payloads. * Supporting all the existing payloads or migrating them * Call out breaking changes * Call out behavior changes| was:We need to update the table version due to the format changes in 1.0. > Rolling upgrade of 1.0 > --- > > Key: HUDI-7665 > URL: https://issues.apache.org/jira/browse/HUDI-7665 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Balaji Varadarajan >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > We need to update the table version due to the format changes in 1.0. > | * Plan to get 1.x readers to read 0.x tables > * Rollout plan for users > * [Writer side] Migrating log files, timeline, metadata > * Migrating table properties including key generators, payloads. > * Supporting all the existing payloads or migrating them > * Call out breaking changes > * Call out behavior changes| -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7665) Rolling upgrade of 1.0
[ https://issues.apache.org/jira/browse/HUDI-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-7665: Assignee: Balaji Varadarajan > Rolling upgrade of 1.0 > --- > > Key: HUDI-7665 > URL: https://issues.apache.org/jira/browse/HUDI-7665 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Balaji Varadarajan >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > We need to update the table version due to the format changes in 1.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7665) Rolling upgrade of 1.0
[ https://issues.apache.org/jira/browse/HUDI-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7665: - Summary: Rolling upgrade of 1.0 (was: Upgrade Table Version) > Rolling upgrade of 1.0 > --- > > Key: HUDI-7665 > URL: https://issues.apache.org/jira/browse/HUDI-7665 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > We need to update the table version due to the format changes in 1.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)
[ https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7538: - Description: For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink) |Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming| |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)| |Changelog|Yes|low|low|Yes| This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporated for supporting existing users/usage of changelog. CDC format is more generalized in the database world. It offers advantages like not requiring further down-stream processing to say stitch together +U and -U, to update a downstream table. for e.g a field that changed is a key in a downstream table, so we need both +U and -U to compute the updates. (A) Introduce a new "changelog" output mode for CDC queries, which generates I,+U,-U,D format that changelog needs (this can be constructed easily by processing the output of CDC query as follows) * when before is `null`, emit I * when after is `null`, emit D * when both are non-null, emit two records +U and -U (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops publishing to _hoodie_operation field # this means, anyone querying this field, using a snapshot query, will break. # we will bring this back in 1.1 etc, based on user feedback as a hidden/field in the FlinkCatalog. (C) To support backwards compatibilty, we fallback to reading `_hoodie_operation` in 0.X tables. For CDC reads, we use first use the CDC log if its avaible for that file slice. If not and base file schema has {{_hoodie_operation}} already, we fallback to reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error for other modes. (D) Snapshot queries from spark, presto, trino etc all work with tables, that have `_hoodie_operation` published. This is already completed for Spark. so others should be easy to do. (E) We need to complete a review of the CDC schema ts - should be completion time or instant time? was: For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink) |Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming| |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)| |Changelog|Yes|low|low|Yes| This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporate. > Consolidate the CDC Formats (changelog format, RFC-51) > -- > > Key: HUDI-7538 > URL: https://issues.apache.org/jira/browse/HUDI-7538 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > For sake of more consistency, we need to consolidate the the changelog mode > (currently supported for Flink MoR) and RFC-51 based CDC feature which is a > debezium style change log (currently supported for CoW for Spark/Flink) > > |Format Name|CDC Source Required|Resource Cost(writer)|Resource > Cost(reader)|Friendly to Streaming| > |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the > debezium style output is not what Flink needs for e.g)| > |Changelog|Yes|low|low|Yes| > This proposal is to converge onto "CDC" as the path going forward, with the > following changes to incorporated for supporting existing users/usage of > changelog. CDC format is more generalized in the database world. It offers > advantages like not requiring further down-stream processing to say stitch > together +U and -U, to update a downstream table. for e.g a field that > changed is a key in a downstream table, so we need both +U and -U to compute > the updates. > > (A) Introduce a new "changelog" output mode for CDC queries, which generates > I,+U,-U,D format that changelog needs (this can be constructed easily by > processing the output of CDC query as follows) > * when before is `null`, emit I > * when after is `null`, emit D > * when both are non-null, emit two records +U and -U > (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops > publishing to _hoodi
[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)
[ https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7538: - Reviewers: Danny Chen, Ethan Guo > Consolidate the CDC Formats (changelog format, RFC-51) > -- > > Key: HUDI-7538 > URL: https://issues.apache.org/jira/browse/HUDI-7538 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > For sake of more consistency, we need to consolidate the the changelog mode > (currently supported for Flink MoR) and RFC-51 based CDC feature which is a > debezium style change log (currently supported for CoW for Spark/Flink) > > |Format Name|CDC Source Required|Resource Cost(writer)|Resource > Cost(reader)|Friendly to Streaming| > |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the > debezium style output is not what Flink needs for e.g)| > |Changelog|Yes|low|low|Yes| > This proposal is to converge onto "CDC" as the path going forward, with the > following changes to incorporated for supporting existing users/usage of > changelog. CDC format is more generalized in the database world. It offers > advantages like not requiring further down-stream processing to say stitch > together +U and -U, to update a downstream table. for e.g a field that > changed is a key in a downstream table, so we need both +U and -U to compute > the updates. > > (A) Introduce a new "changelog" output mode for CDC queries, which generates > I,+U,-U,D format that changelog needs (this can be constructed easily by > processing the output of CDC query as follows) > * when before is `null`, emit I > * when after is `null`, emit D > * when both are non-null, emit two records +U and -U > (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops > publishing to _hoodie_operation field > # this means, anyone querying this field, using a snapshot query, will break. > # we will bring this back in 1.1 etc, based on user feedback as a > hidden/field in the FlinkCatalog. > (C) To support backwards compatibilty, we fallback to reading > `_hoodie_operation` in 0.X tables. > For CDC reads, we use first use the CDC log if its avaible for that file > slice. If not and base file schema has {{_hoodie_operation}} already, we > fallback to reading {{_hoodie_operation}} from base file if > mode=OP_KEY_ONLY.. Throw error for other modes. > (D) Snapshot queries from spark, presto, trino etc all work with tables, that > have `_hoodie_operation` published. > This is already completed for Spark. so others should be easy to do. > > (E) We need to complete a review of the CDC schema > ts - should be completion time or instant time? > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)
[ https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7538: - Description: For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink) |Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming| |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)| |Changelog|Yes|low|low|Yes| This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporate. was: For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink) |Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming| |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)| |Changelog|Yes|low|low|Yes| This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporate. > Consolidate the CDC Formats (changelog format, RFC-51) > -- > > Key: HUDI-7538 > URL: https://issues.apache.org/jira/browse/HUDI-7538 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > For sake of more consistency, we need to consolidate the the changelog mode > (currently supported for Flink MoR) and RFC-51 based CDC feature which is a > debezium style change log (currently supported for CoW for Spark/Flink) > > |Format Name|CDC Source Required|Resource Cost(writer)|Resource > Cost(reader)|Friendly to Streaming| > |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the > debezium style output is not what Flink needs for e.g)| > |Changelog|Yes|low|low|Yes| > This proposal is to converge onto "CDC" as the path going forward, with the > following changes to incorporate. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)
[ https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7538: - Description: For sake of more consistency, we need to consolidate the the changelog mode (currently supported for Flink MoR) and RFC-51 based CDC feature which is a debezium style change log (currently supported for CoW for Spark/Flink) |Format Name|CDC Source Required|Resource Cost(writer)|Resource Cost(reader)|Friendly to Streaming| |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium style output is not what Flink needs for e.g)| |Changelog|Yes|low|low|Yes| This proposal is to converge onto "CDC" as the path going forward, with the following changes to incorporate. > Consolidate the CDC Formats (changelog format, RFC-51) > -- > > Key: HUDI-7538 > URL: https://issues.apache.org/jira/browse/HUDI-7538 > Project: Apache Hudi > Issue Type: Improvement > Components: storage-management >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > For sake of more consistency, we need to consolidate the the changelog mode > (currently supported for Flink MoR) and RFC-51 based CDC feature which is a > debezium style change log (currently supported for CoW for Spark/Flink) > > |Format Name|CDC Source Required|Resource Cost(writer)|Resource > Cost(reader)|Friendly to Streaming| > |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the > debezium style output is not what Flink needs for e.g)| > |Changelog|Yes|low|low|Yes| > This proposal is to converge onto "CDC" as the path going forward, with the > following changes to incorporate. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6712: - Status: Open (was: Patch Available) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6700) Archiving should be time based, not this min-max and not per instant. Lets treat it like a log (Phase 2)
[ https://issues.apache.org/jira/browse/HUDI-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6700: - Status: Open (was: In Progress) > Archiving should be time based, not this min-max and not per instant. Lets > treat it like a log (Phase 2) > > > Key: HUDI-6700 > URL: https://issues.apache.org/jira/browse/HUDI-6700 > Project: Apache Hudi > Issue Type: Bug >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-1045) Support updates during clustering
[ https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842466#comment-17842466 ] Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:26 PM: --- h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format if we truly wish to achieve, independent operations of C and W, while minimizing the amount of work redone on the writer side, we need to introduce a notion of "pointer data blocks" (name TBD) and a new "retain" command block - in Hudi's log format. *Pointer data blocks* A pointer data block just keeps pointers to other blocks in a different file groups. {code:java} pointer data block { [ {fg1, logfile X, ..}, {fg2, logfile Y, ..}, {fg3, logfile Z, ..} ] } {code} In this approach, instead of redistributing the records from file groups f1,f2,f3 to f4,f5 - we will * log pointer data blocks to f4, f5 , pointing back to new log files W wrote to file groups f1,f2,f3. (X,Y,Z in the example above) * log retain command blocks to f1 (with logfilex),f2 (with logfileY),f3 (with logfile z) to indicate to the cleaner service, that these log files are supposed to be retained for later, so it can be skipped. * When f4,f5 is compacted, these log files will be reconciled to new file slices in f4, f5 * When the file slices in f4 and f5 are cleaned, the pointers are followed and the retained logfiles in f1,f2,f3 can also be cleaned. (note: need some reference counting in case f4 is cleaned while f5 is not) Note that there is a 1:N relationship between pointer data block and log files. For e.g both f4, f5's pointer data blocks will be pointing to all three files X,Y,Z. A snapshot read of the file groups f4, f5 need to carefully filter records to avoid exposing duplicate records to the reader. * merges the log files pointed to by pointer data blocks based on keys i.e only delete/update records in the base file for keys that match from the pointed log files. * inserts need to be handled with care and need to be distinguishable from an update e.g a record in pointed log files for f4 can either be an insert or a update to a record that is now clustered into f5. * storage format also needs to store a bitmap indicating which records are inserts vs updates in the pointed to log files. Once such a list of available, then the reader of f4/f5 can split the inserts amongst them, by using a hash mod of the insert key for e.g i.e f4 will get insert 0, 2, 4, 6 ... in log file x,y,z. while f5 will get inserts 1,3,5,7 in log files x,y,z h3. Pros: * Truly non-blocking, C and W can go at their own pace, complete without much additional overhead since the pointer blocks are pretty light to add. * Works even for high throughput streaming scenarios where there is not much time or writer to be reconciling (e.g Flink) h3. Cons: * More merge costs on the read/query side (although can be very tolerable for same cases approach 1 works well for) * More complexity and new storage changes. was (Author: vc): h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format if we truly wish to achieve, independent operations of C and W, while minimizing the amount of work redone on the writer side, we need to introduce a notion of "pointer data blocks" (name TBD) in Hudi's log format. *Pointer data blocks* A pointer data block just keeps pointers to other blocks in a different file groups. {code:java} pointer data block { [ {fg1, logfileX, ..}, {fg2, logfileY, ..}, {fg3, logfileX, ..} ] } {code} In this approach, instead of redistributing the records from > Support updates during clustering > - > > Key: HUDI-1045 > URL: https://issues.apache.org/jira/browse/HUDI-1045 > Project: Apache Hudi > Issue Type: Task > Components: clustering, table-service >Reporter: leesf >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > h4. We need to allow a writer w writing to file groups f1, f2, f3, > concurrently while a clustering service C reclusters them into f4, f5. > Goals > * Writes can be either updates, deletes or inserts. > * Either clustering C or the writer W can finish first > * Both W and C need to be able to complete their actions without much > redoing of work. > * The number of output file groups for C can be higher or lower than input > file groups. > * Need to work across and be oblivious to whether the writers are operating > in OCC or NBCC modes > * Needs to interplay well with cleaning and compaction services. > h4. Non-goals > * Strictly the sort order achieved by clustering, in face of updates (e.g > updates change clustering field values, causing output clustering file groups > to be n
[jira] [Comment Edited] (HUDI-1045) Support updates during clustering
[ https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842466#comment-17842466 ] Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:01 PM: --- h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format if we truly wish to achieve, independent operations of C and W, while minimizing the amount of work redone on the writer side, we need to introduce a notion of "pointer data blocks" (name TBD) in Hudi's log format. *Pointer data blocks* A pointer data block just keeps pointers to other blocks in a different file groups. {code:java} pointer data block { [ {fg1, logfileX, ..}, {fg2, logfileY, ..}, {fg3, logfileX, ..} ] } {code} In this approach, instead of redistributing the records from was (Author: vc): h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format if we truly wish to achieve, independent operations of C and W, without > Support updates during clustering > - > > Key: HUDI-1045 > URL: https://issues.apache.org/jira/browse/HUDI-1045 > Project: Apache Hudi > Issue Type: Task > Components: clustering, table-service >Reporter: leesf >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > h4. We need to allow a writer w writing to file groups f1, f2, f3, > concurrently while a clustering service C reclusters them into f4, f5. > Goals > * Writes can be either updates, deletes or inserts. > * Either clustering C or the writer W can finish first > * Both W and C need to be able to complete their actions without much > redoing of work. > * The number of output file groups for C can be higher or lower than input > file groups. > * Need to work across and be oblivious to whether the writers are operating > in OCC or NBCC modes > * Needs to interplay well with cleaning and compaction services. > h4. Non-goals > * Strictly the sort order achieved by clustering, in face of updates (e.g > updates change clustering field values, causing output clustering file groups > to be not fully sorted by those fields) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1045) Support updates during clustering
[ https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-1045: - Description: h4. We need to allow a writer w writing to file groups f1, f2, f3, concurrently while a clustering service C reclusters them into f4, f5. Goals * Writes can be either updates, deletes or inserts. * Either clustering C or the writer W can finish first * Both W and C need to be able to complete their actions without much redoing of work. * The number of output file groups for C can be higher or lower than input file groups. * Need to work across and be oblivious to whether the writers are operating in OCC or NBCC modes * Needs to interplay well with cleaning and compaction services. h4. Non-goals * Strictly the sort order achieved by clustering, in face of updates (e.g updates change clustering field values, causing output clustering file groups to be not fully sorted by those fields) was: We need to allow a writer w writing to file groups f1, f2, f3, concurrently while a clustering service C reclusters them into f4, f5. * Writes can be either updates, deletes or inserts. * Either clustering C or the writer W can finish first * Both W and C need to be able to complete their actions without much redoing of work. * The number of output file groups for C can be higher or lower than input file groups. * Need to work across and be oblivious to whether the writers are operating in OCC or NBCC modes * Needs to interplay well with cleaning and compaction services. > Support updates during clustering > - > > Key: HUDI-1045 > URL: https://issues.apache.org/jira/browse/HUDI-1045 > Project: Apache Hudi > Issue Type: Task > Components: clustering, table-service >Reporter: leesf >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > h4. We need to allow a writer w writing to file groups f1, f2, f3, > concurrently while a clustering service C reclusters them into f4, f5. > Goals > * Writes can be either updates, deletes or inserts. > * Either clustering C or the writer W can finish first > * Both W and C need to be able to complete their actions without much > redoing of work. > * The number of output file groups for C can be higher or lower than input > file groups. > * Need to work across and be oblivious to whether the writers are operating > in OCC or NBCC modes > * Needs to interplay well with cleaning and compaction services. > h4. Non-goals > * Strictly the sort order achieved by clustering, in face of updates (e.g > updates change clustering field values, causing output clustering file groups > to be not fully sorted by those fields) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-1045) Support updates during clustering
[ https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842465#comment-17842465 ] Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM: --- h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups Within the finalize section (done within a table level distributed lock), we could either have W or C perform the following . {code:java} W { - identify the file groups that have been clustered concurrently by C - Read out all records written by W, into these conflicting file groups - Redistribute records based on new records distribution based on C - finalize W } {code} {code:java} C { - identify the file groups that have been written to concurrently by W. - Read out all records written by such W, into conflicting file groups - Redistribute records based on new records distribution, based on C - finalize C } {code} h3. Pros: # Simple to understand/debug, no storage format changes. # Could work well for cases where the overlap between C and W is rather small. # No extra read amplification for queries, W/C absorbs tha cost. {*}Cons{*}: # Can be pretty wasteful in continuous writers or with high overlap between C and W, forcing the entire write to be redone effectively (same as writer failing and retrying like today) # Particularly more expensive for CoW, where W has paid the cost of merging columnar base files, with incoming records. was (Author: vc): h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups Within the finalize section (done within a table level distributed lock), we could either have W or C perform the following . {code:java} W { - identify the file groups that have been clustered concurrently by C - Read out all records written by W, into these conflicting file groups - Redistribute records based on new records distribution based on C - finalize W } {code} {code:java} C { - identify the file groups that have been written to concurrently by W. - Read out all records written by such W, into conflicting file groups - Redistribute records based on new records distribution, based on C - finalize C } {code} h3. Pros: # Simple to understand/debug, no storage format changes. # Could work well for cases where # Absorbs any read amplification. h3. Cons: # sort order may be disturbed from the re-distribtion of keys. # Can be pretty wasteful, if > Support updates during clustering > - > > Key: HUDI-1045 > URL: https://issues.apache.org/jira/browse/HUDI-1045 > Project: Apache Hudi > Issue Type: Task > Components: clustering, table-service >Reporter: leesf >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > We need to allow a writer w writing to file groups f1, f2, f3, concurrently > while a clustering service C reclusters them into f4, f5. > * Writes can be either updates, deletes or inserts. > * Either clustering C or the writer W can finish first > * Both W and C need to be able to complete their actions without much > redoing of work. > * The number of output file groups for C can be higher or lower than input > file groups. > * Need to work across and be oblivious to whether the writers are operating > in OCC or NBCC modes > * Needs to interplay well with cleaning and compaction services. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-1045) Support updates during clustering
[ https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842465#comment-17842465 ] Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM: --- h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups Within the finalize section (done within a table level distributed lock), we could either have W or C perform the following . {code:java} W { - identify the file groups that have been clustered concurrently by C - Read out all records written by W, into these conflicting file groups - Redistribute records based on new records distribution based on C - finalize W } {code} {code:java} C { - identify the file groups that have been written to concurrently by W. - Read out all records written by such W, into conflicting file groups - Redistribute records based on new records distribution, based on C - finalize C } {code} h3. Pros: # Simple to understand/debug, no storage format changes. # Could work well for cases where the overlap between C and W is rather small. # No extra read amplification for queries, W/C absorbs tha cost. {*}Cons{*}: # Can be pretty wasteful in continuous writers or with high overlap between C and W, forcing the entire write to be redone effectively (same as writer failing and retrying like today) # Particularly more expensive for CoW, where W has paid the cost of merging columnar base files, with incoming records. was (Author: vc): h2. [WIP] Approach 1 : Redistribute records from the conflicting file groups Within the finalize section (done within a table level distributed lock), we could either have W or C perform the following . {code:java} W { - identify the file groups that have been clustered concurrently by C - Read out all records written by W, into these conflicting file groups - Redistribute records based on new records distribution based on C - finalize W } {code} {code:java} C { - identify the file groups that have been written to concurrently by W. - Read out all records written by such W, into conflicting file groups - Redistribute records based on new records distribution, based on C - finalize C } {code} h3. Pros: # Simple to understand/debug, no storage format changes. # Could work well for cases where the overlap between C and W is rather small. # No extra read amplification for queries, W/C absorbs tha cost. {*}Cons{*}: # Can be pretty wasteful in continuous writers or with high overlap between C and W, forcing the entire write to be redone effectively (same as writer failing and retrying like today) # Particularly more expensive for CoW, where W has paid the cost of merging columnar base files, with incoming records. > Support updates during clustering > - > > Key: HUDI-1045 > URL: https://issues.apache.org/jira/browse/HUDI-1045 > Project: Apache Hudi > Issue Type: Task > Components: clustering, table-service >Reporter: leesf >Assignee: Vinoth Chandar >Priority: Blocker > Fix For: 1.0.0 > > > We need to allow a writer w writing to file groups f1, f2, f3, concurrently > while a clustering service C reclusters them into f4, f5. > * Writes can be either updates, deletes or inserts. > * Either clustering C or the writer W can finish first > * Both W and C need to be able to complete their actions without much > redoing of work. > * The number of output file groups for C can be higher or lower than input > file groups. > * Need to work across and be oblivious to whether the writers are operating > in OCC or NBCC modes > * Needs to interplay well with cleaning and compaction services. -- This message was sent by Atlassian Jira (v8.20.10#820010)