[jira] [Updated] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files

2024-09-17 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8210:
-
Status: In Progress  (was: Open)

> [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log 
> files
> --
>
> Key: HUDI-8210
> URL: https://issues.apache.org/jira/browse/HUDI-8210
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8211) Map table versions to log block and log format versions

2024-09-17 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8211:
-
Status: In Progress  (was: Open)

> Map table versions to log block and log format versions
> ---
>
> Key: HUDI-8211
> URL: https://issues.apache.org/jira/browse/HUDI-8211
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8211) Map table versions to log block and log format versions

2024-09-17 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8211:
-
Fix Version/s: 1.0.0

> Map table versions to log block and log format versions
> ---
>
> Key: HUDI-8211
> URL: https://issues.apache.org/jira/browse/HUDI-8211
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files

2024-09-17 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8210:
-
Sprint: Hudi 1.0 Sprint 2024/09/16-22

> [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log 
> files
> --
>
> Key: HUDI-8210
> URL: https://issues.apache.org/jira/browse/HUDI-8210
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8211) Map table versions to log block and log format versions

2024-09-17 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8211:


 Summary: Map table versions to log block and log format versions
 Key: HUDI-8211
 URL: https://issues.apache.org/jira/browse/HUDI-8211
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files

2024-09-17 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8210:
-
Story Points: 8

> [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log 
> files
> --
>
> Key: HUDI-8210
> URL: https://issues.apache.org/jira/browse/HUDI-8210
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8210) [UMBRELLA] Making 1.0 writer produce both table version 6 & 8 compatible log files

2024-09-17 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8210:


 Summary: [UMBRELLA] Making 1.0 writer produce both table version 6 
& 8 compatible log files
 Key: HUDI-8210
 URL: https://issues.apache.org/jira/browse/HUDI-8210
 Project: Apache Hudi
  Issue Type: New Feature
  Components: storage-management
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7227) Enable completion time for File Group Reader

2024-09-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7227:


Assignee: Ethan Guo  (was: Vinoth Chandar)

> Enable completion time for File Group Reader
> 
>
> Key: HUDI-7227
> URL: https://issues.apache.org/jira/browse/HUDI-7227
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Lin Liu
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> For all query types, we should enable completion time semantic safely:
>  # Snapshot/RO (MOR, COW)
>  # Incremental (MOR, COW)
>  # TimeTravel (MOR, COW),
>  # CDC (MOR, COW)
>  # Bootstrap 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-8129) Bring back table version 6 archived timeline

2024-08-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-8129:


Assignee: Balaji Varadarajan

> Bring back table version 6 archived timeline
> 
>
> Key: HUDI-8129
> URL: https://issues.apache.org/jira/browse/HUDI-8129
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-8128) Bring back table version 6 active timeline

2024-08-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-8128:


Assignee: Balaji Varadarajan

> Bring back table version 6 active timeline
> --
>
> Key: HUDI-8128
> URL: https://issues.apache.org/jira/browse/HUDI-8128
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-8127) Introduce active timeline implementation based on timeline layout version

2024-08-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-8127:


Assignee: Balaji Varadarajan

> Introduce active timeline implementation based on timeline layout version
> -
>
> Key: HUDI-8127
> URL: https://issues.apache.org/jira/browse/HUDI-8127
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 

*Table Properties* 

All table properties need to be table version aware. 

*Timeline* 

Timeline already has a timeline layout version, which can be extended to write 
older and newer timeline. The ArchivedTimeline can be old style or LSM style, 
depending on table version. Typically, we have only written timeline with 
latest schema, we may have to version the .avsc files themselves and write 
using the table version specific avro specific class? Also we need a flexible 
mapping from action names used in each table version, to handle things like 
replacecommit to cluster.

{*}TBD{*}: whether or not completion time based changes can be retained, 
assuming instant file creation timestamp. 
h4. *FileSystemView*

Need to support file slice grouping based on old/new timeline + file naming 
combination. Also need abstractions for how files are named in each table 
version, to handle things like log file name changes.
h4. *WriteHandle*

This layer may or may not need changes, as base files don't really change. and 
log format is already versioned (see next point). However, it's prudent to 
ensure we have a mechanism for this, since there could be different ways of 
encoding records or footers etc (e.g HFile k-v pairs, ...) 
h4. Metadata table 

Encoding of k-v pairs, their schemas etc.. which partitions are supported in 
what table versions. *TBD* to see if any of the code around recent 
simplification needs to be undone. 
h4. LogFormat Reader/Writer

the blocks and format itself is versioned, but its not yet tied to the table 
version overall. So we need these links so that the reader can for eg decide 
how to read/assemble log file scanning? FileGroupReader abstractions may need 
to change as well. 
h4. Table Service

This should largely be independent and built on layers above. but some cases 
exist, depending on whether the code relies on format specifics for generating 
plans or execution. This may be a mix of code changes, acceptable behavior 
changes and format specific specializations. 

TBD: to see if behavior like writing rollback block needs to be 

  was:
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer com

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 

*Table Properties* 

All table properties need to be table version aware. 

*Timeline* 

Timeline already has a timeline layout version, which can be extended to write 
older and newer timeline. The ArchivedTimeline can be old style or LSM style, 
depending on table version. Typically, we have only written timeline with 
latest schema, we may have to version the .avsc files themselves and write 
using the table version specific avro specific class? Also we need a flexible 
mapping from action names used in each table version, to handle things like 
replacecommit to cluster.

{*}TBD{*}: whether or not completion time based changes can be retained, 
assuming instant file creation timestamp. 
h4. *FileSystemView*

Need to support file slice grouping based on old/new timeline + file naming 
combination. Also need abstractions for how files are named in each table 
version, to handle things like log file name changes.
h4. *WriteHandle*

This layer may or may not need changes, as base files don't really change. and 
log format is already versioned (see next point). However, it's prudent to 
ensure we have a mechanism for this, since there could be different ways of 
encoding records or footers etc (e.g HFile k-v pairs, ...) 
h4. Metadata table 

Encoding of k-v pairs, their schemas etc.. which partitions are supported in 
what table versions. *TBD* to see if any of the code around recent 
simplification needs to be undone. 
h4. LogFormat Reader/Writer

the blocks and format itself is version, but its not yet tied to the table 
version overall. So we need these links so that the reader can for eg decide 
how to read/assemble log file scanning? FileGroupReader abstractions may need 
to change as well. 
h4. Table Service

This should largely be independent and built on layers above. but some cases 
exist, depending on whether the code relies on format specifics for generating 
plans or execution. This may be a mix of code changes, acceptable behavior 
changes and format specific specializations. 

TBD: to see if behavior like writing rollback block needs to be 

  was:
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer compl

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Status: In Progress  (was: Open)

> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> *(!) Work in Progress.* 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
> reading the table even when writers are upgraded, as long as writer produces 
> storage compatible with the latest table version the reader can read. The 
> readers can then be rolling upgraded easily, at different cadence without 
> need for any tight co-ordination. Additionally, the reader should have 
> ability to "dynamically" deduce table version based on table properties, such 
> that when the writer is switched to the latest table version, subsequent 
> reads will just adapt and read it as the latest table version. 
> Operators still need to ensure all readers have the latest binary that 
> supports a given table version, before switching the writer to that version. 
> Special consideration to table services, as reader/writer processes, that 
> should be able manage the tables as well. Queries should gracefully fail 
> during table version switches and start eventually succeeding when writer 
> completes switching. Writers/table services should fail if working with an 
> unsupported table version, without which one cannot start switching writers 
> to new version (this may still need a minor release on the last 2-3 table 
> versions?)
> h3. High level approach: 
> We need to introduce table version aware reading/writing inside the core 
> layers of Hudi, as well as query engines like Spark/Flink. 
> To this effect : We need a HoodieStorageFormat abstraction that can cover the 
> following layers . 
> *Table Properties* 
> All table properties need to be table version aware. 
> *Timeline* 
> Timeline already has a timeline layout version, which can be extended to 
> write older and newer timeline. The ArchivedTimeline can be old style or LSM 
> style, depending on table version. Typically, we have only written timeline 
> with latest schema, we may have to version the .avsc files themselves and 
> write using the table version specific avro specific class? Also we need a 
> flexible mapping from action names used in each table version, to handle 
> things like replacecommit to cluster.
> {*}TBD{*}: whether or not completion time based changes can be retained, 
> assuming instant file creation timestamp. 
> h4. *FileSystemView*
> Need to support file slice grouping based on old/new timeline + file naming 
> combination. Also need abstractions for how files are named in each table 
> version, to handle things like log file name changes.
> h4. *WriteHandle*
> This layer may or may not need changes, as base files don't really change. 
> and log format is already versioned (see next point). However, it's prudent 
> to ensure we have a mechanism for this, since there could be different ways 
> of encoding records or footers etc (e.g HFile k-v pairs, ...) 
> h4. Metadata table 
> Encoding of k-v pairs, their schemas etc.. which partitions are supported in 
> what table versions. *TBD* to see if any of the code around recent 
> simplification needs to be undone. 
> h4. LogFormat Reader/Writer
> the blogs and format itself is version, but its not yet tied to the table 
> version overall. So we need these links so that the reader can for eg decide 
> how to read/assemble log file scanning? FileGroupReader abstractions may need 
> to change as well. 
> h4. Table Service
> This should largely be independent and built on layers above. but some cases 
> exist, depending on whether the code relies on format specifics for 
> generating plans or execution. This may be a mix of code changes, acceptable 
> behavior changes and format specific specializations. 
> TBD: to see if behavior like writing rollback block needs to be 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8132:
-
Parent: (was: HUDI-8076)
Issue Type: Improvement  (was: Sub-task)

> Implement support for writer to produce table version specific log 
> blocks/format
> 
>
> Key: HUDI-8132
> URL: https://issues.apache.org/jira/browse/HUDI-8132
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8132:
-
Epic Link: HUDI-7856

> Implement support for writer to produce table version specific log 
> blocks/format
> 
>
> Key: HUDI-8132
> URL: https://issues.apache.org/jira/browse/HUDI-8132
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8131) Associate log format and block versions to table versions

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8131:
-
Fix Version/s: 1.0.0

> Associate log format and block versions to table versions
> -
>
> Key: HUDI-8131
> URL: https://issues.apache.org/jira/browse/HUDI-8131
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8131) Associate log format and block versions to table versions

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8131:
-
Epic Link: HUDI-7856

> Associate log format and block versions to table versions
> -
>
> Key: HUDI-8131
> URL: https://issues.apache.org/jira/browse/HUDI-8131
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8131) Associate log format and block versions to table versions

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8131:
-
Parent: (was: HUDI-8076)
Issue Type: Improvement  (was: Sub-task)

> Associate log format and block versions to table versions
> -
>
> Key: HUDI-8131
> URL: https://issues.apache.org/jira/browse/HUDI-8131
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8130) Introduce table version specific APIs for file name generation

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8130:
-
Fix Version/s: 1.0.0

> Introduce table version specific APIs for file name generation
> --
>
> Key: HUDI-8130
> URL: https://issues.apache.org/jira/browse/HUDI-8130
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8130) Introduce table version specific APIs for file name generation

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8130:
-
Parent: (was: HUDI-8076)
Issue Type: Improvement  (was: Sub-task)

> Introduce table version specific APIs for file name generation
> --
>
> Key: HUDI-8130
> URL: https://issues.apache.org/jira/browse/HUDI-8130
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8130) Introduce table version specific APIs for file name generation

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8130:
-
Epic Link: HUDI-7856

> Introduce table version specific APIs for file name generation
> --
>
> Key: HUDI-8130
> URL: https://issues.apache.org/jira/browse/HUDI-8130
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8132:
-
Fix Version/s: 1.0.0

> Implement support for writer to produce table version specific log 
> blocks/format
> 
>
> Key: HUDI-8132
> URL: https://issues.apache.org/jira/browse/HUDI-8132
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8127) Introduce active timeline implementation based on timeline layout version

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8127:
-
Parent: (was: HUDI-8076)
Issue Type: Improvement  (was: Sub-task)

> Introduce active timeline implementation based on timeline layout version
> -
>
> Key: HUDI-8127
> URL: https://issues.apache.org/jira/browse/HUDI-8127
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8128) Bring back table version 6 active timeline

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8128:
-
Epic Link: HUDI-7856

> Bring back table version 6 active timeline
> --
>
> Key: HUDI-8128
> URL: https://issues.apache.org/jira/browse/HUDI-8128
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8129:
-
Parent: (was: HUDI-8076)
Issue Type: Improvement  (was: Sub-task)

> Bring back table version 6 archived timeline
> 
>
> Key: HUDI-8129
> URL: https://issues.apache.org/jira/browse/HUDI-8129
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8127) Introduce active timeline implementation based on timeline layout version

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8127:
-
Epic Link: HUDI-7856

> Introduce active timeline implementation based on timeline layout version
> -
>
> Key: HUDI-8127
> URL: https://issues.apache.org/jira/browse/HUDI-8127
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8128) Bring back table version 6 active timeline

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8128:
-
Parent: (was: HUDI-8076)
Issue Type: Improvement  (was: Sub-task)

> Bring back table version 6 active timeline
> --
>
> Key: HUDI-8128
> URL: https://issues.apache.org/jira/browse/HUDI-8128
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8129:
-
Fix Version/s: 1.0.0

> Bring back table version 6 archived timeline
> 
>
> Key: HUDI-8129
> URL: https://issues.apache.org/jira/browse/HUDI-8129
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8129:
-
Epic Link: HUDI-7856

> Bring back table version 6 archived timeline
> 
>
> Key: HUDI-8129
> URL: https://issues.apache.org/jira/browse/HUDI-8129
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8103) Introduce table version specific grouping of table configs

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8103:
-
Parent: (was: HUDI-8076)
Issue Type: Improvement  (was: Sub-task)

> Introduce table version specific grouping of table configs
> --
>
> Key: HUDI-8103
> URL: https://issues.apache.org/jira/browse/HUDI-8103
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> based on the value of hoodie.table.version, we need to write different set of 
> default table configs, and potentially different default values. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7888) Throw meaningful error when reading partial update or DV written in 1.x from 0.16.0 reader

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7888:


Assignee: (was: Jonathan Vexler)

> Throw meaningful error when reading partial update or DV written in 1.x from 
> 0.16.0 reader
> --
>
> Key: HUDI-7888
> URL: https://issues.apache.org/jira/browse/HUDI-7888
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> If 0.16.x reader is used to read 1.x table having partial updates/merges 
> enabled, we need to throw meaningful error to end user. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7971) Test and Certify 0.14.x tables are readable in 1.x Hudi reader

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7971:


Assignee: (was: Lokesh Jain)

> Test and Certify 0.14.x tables are readable in 1.x Hudi reader 
> ---
>
> Key: HUDI-7971
> URL: https://issues.apache.org/jira/browse/HUDI-7971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 1.0.0
>
>
> Lets ensure 1.x reader is fully compatible w/ reading any of 0.14.x to 0.16.x 
> tables 
>  
> Readers :  1.x
>  # Spark SQL
>  # Spark Datasource
>  # Trino/Presto
>  # Hive
>  # Flink
> Writer: 0.16
> Table State:
>  * COW
>  ** few write commits 
>  ** Pending clustering
>  ** Completed Clustering
>  ** Failed writes with no rollbacks
>  ** Insert overwrite table/partition
>  ** Savepoint for Time-travel query
>  * MOR
>  ** Same as COW
>  ** Pending and completed async compaction (with log-files and no base file)
>  ** Custom Payloads (for MOR snapshot queries) (e:g SQL Expression Payload)
>  ** Log block formats - DELETE, rollback block
> Other knobs:
>  # Metadata enabled/disabled (all combinations)
>  # Column Stats enabled/disabled and data-skipping enabled/disabled
>  # RLI enabled with eq/IN queries
>  # Non-Partitioned dataset (all combinations)
>  # CDC Reads 
>  # Incremental Reads
>  # Time-travel query
>  
> What to test ?
>  # Query Results Correctness
>  # Performance : See the benefit of 
>  # Partition Pruning
>  # Metadata  table - col stats, RLI,
>  
> Corner Case Testing:
>  
>  # Schema Evolution with different file-groups having different generation of 
> schema
>  # Dynamic Partition Pruning
>  # Does Column Projection work correctly for log files reading 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8103) Introduce table version specific grouping of table configs

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8103:
-
Epic Link: HUDI-7856

> Introduce table version specific grouping of table configs
> --
>
> Key: HUDI-8103
> URL: https://issues.apache.org/jira/browse/HUDI-8103
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> based on the value of hoodie.table.version, we need to write different set of 
> default table configs, and potentially different default values. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7972) Add fallback for deletion vector in 0.16.x reader while reading 1.x tables

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7972:


Assignee: (was: sivabalan narayanan)

> Add fallback for deletion vector in 0.16.x reader while reading 1.x tables
> --
>
> Key: HUDI-7972
> URL: https://issues.apache.org/jira/browse/HUDI-7972
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: 1.0-migration
> Fix For: 1.0.0
>
>
> If 0.16.x reader is used to read a 1.x table with deletion vector, we should 
> fallback to using key based merges instead of position based merges. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7887) Any log format header types changes need to be ported to 0.16.0 from 1.x

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7887:


Assignee: (was: Jonathan Vexler)

> Any log format header types changes need to be ported to 0.16.0 from 1.x
> 
>
> Key: HUDI-7887
> URL: https://issues.apache.org/jira/browse/HUDI-7887
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> Port any new log header metadata types introduced in 1.x to 0.16.0 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7886) Make metadata payload from 1.x readable in 0.16.0

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7886:


Assignee: (was: Lokesh Jain)

> Make metadata payload from 1.x readable in 0.16.0
> -
>
> Key: HUDI-7886
> URL: https://issues.apache.org/jira/browse/HUDI-7886
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> We wanted to support reading 1.x tables in 0.16.0 reader.   
>  
> So, lets port over all metadata payload schema changes to 0.16.0 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7882) RFC-78 for Bridge Release

2024-08-28 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877561#comment-17877561
 ] 

Vinoth Chandar commented on HUDI-7882:
--

I think we are discontinuing this approach over 
https://issues.apache.org/jira/browse/HUDI-8076 

> RFC-78 for Bridge Release 
> --
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We have 4 major goals w/ this umbrella ticket. 
> a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for 
> all query types. 
> b. 0.16.x should be capable of reading 1.x tables for most features
> c. Upgrade 0.16.x to 1.x 
> d. Downgrade 1.x to 0.16.0. 
>  
>  
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7882) RFC-78 for Bridge Release

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7882:
-
Summary: RFC-78 for Bridge Release   (was: Umbrella ticket for 1.x tables 
and 0.16.x compatibility)

> RFC-78 for Bridge Release 
> --
>
> Key: HUDI-7882
> URL: https://issues.apache.org/jira/browse/HUDI-7882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0, 1.0.0
>
>
> We have 4 major goals w/ this umbrella ticket. 
> a. 1.x reader should be capable of reading any of 0.14.x to 0.16.x tables for 
> all query types. 
> b. 0.16.x should be capable of reading 1.x tables for most features
> c. Upgrade 0.16.x to 1.x 
> d. Downgrade 1.x to 0.16.0. 
>  
>  
> We wanted to support reading 1.x tables in 0.16.0 release. So, creating this 
> umbrella ticket to track all of them.
>  
> RFC in progress: [https://github.com/apache/hudi/pull/11514] 
>  
> Changes required to be ported: 
> 0. Creating 0.16.0 branch
> 0.a https://issues.apache.org/jira/browse/HUDI-7860 Completed. 
>  
> 1. Timeline 
> 1.a Hoodie instant parsing should be able to read 1.x instants. 
> https://issues.apache.org/jira/browse/HUDI-7883 Sagar. 
> 1.b Commit metadata parsing is able to handle both json and avro formats. 
> Scope might be non-trivial.  https://issues.apache.org/jira/browse/HUDI-7866  
> Siva.
> 1.c HoodieDefaultTimeline able to read both timelines based on table version. 
>  https://issues.apache.org/jira/browse/HUDI-7884 Siva.
> 1.d Reading LSM timeline using 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7890 Siva. 
> 1.e Ensure 1.0 MDT timeline is readable by 0.16 - HUDI-7901
>  
> 2. Table property changes 
> 2.a Table property changes https://issues.apache.org/jira/browse/HUDI-7885  
> https://issues.apache.org/jira/browse/HUDI-7865 LJ
>  
> 3. MDT table changes
> 3.a record positions to RLI https://issues.apache.org/jira/browse/HUDI-7877 LJ
> 3.b MDT payload schema changes. 
> https://issues.apache.org/jira/browse/HUDI-7886 LJ
>  
> 4. Log format changes
> 4.a All metadata header types porting 
> https://issues.apache.org/jira/browse/HUDI-7887 Jon
> 4.b Meaningful error for incompatible features from 1.x 
> https://issues.apache.org/jira/browse/HUDI-7888 Jon
>  
> 5. Log file slice or grouping detection compatibility 
>  
> 5. Tests 
> 5.a Tests to validate that 1.x tables can be read w/ 0.16.0 
> https://issues.apache.org/jira/browse/HUDI-7896 Siva and Sagar. 
>  
> 6 Doc changes 
> 6.a Call out unsupported features in 0.16.0 reader when reading 1.x tables. 
> https://issues.apache.org/jira/browse/HUDI-7889 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8132) Implement support for writer to produce table version specific log blocks/format

2024-08-28 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8132:


 Summary: Implement support for writer to produce table version 
specific log blocks/format
 Key: HUDI-8132
 URL: https://issues.apache.org/jira/browse/HUDI-8132
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8131) Associate log format and block versions to table versions

2024-08-28 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8131:


 Summary: Associate log format and block versions to table versions
 Key: HUDI-8131
 URL: https://issues.apache.org/jira/browse/HUDI-8131
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8130) Introduce table version specific APIs for file name generation

2024-08-28 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8130:


 Summary: Introduce table version specific APIs for file name 
generation
 Key: HUDI-8130
 URL: https://issues.apache.org/jira/browse/HUDI-8130
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8129) Bring back table version 6 archived timeline

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8129:
-
Summary: Bring back table version 6 archived timeline  (was: Bring back 
table version 6 active timeline)

> Bring back table version 6 archived timeline
> 
>
> Key: HUDI-8129
> URL: https://issues.apache.org/jira/browse/HUDI-8129
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8129) Bring back table version 6 active timeline

2024-08-28 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8129:


 Summary: Bring back table version 6 active timeline
 Key: HUDI-8129
 URL: https://issues.apache.org/jira/browse/HUDI-8129
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8128) Bring back table version 6 active timeline

2024-08-28 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8128:


 Summary: Bring back table version 6 active timeline
 Key: HUDI-8128
 URL: https://issues.apache.org/jira/browse/HUDI-8128
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8127) Introduce active timeline implementation based on timeline layout version

2024-08-28 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8127:


 Summary: Introduce active timeline implementation based on 
timeline layout version
 Key: HUDI-8127
 URL: https://issues.apache.org/jira/browse/HUDI-8127
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8103) Introduce table version specific grouping of table configs

2024-08-28 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8103:
-
Description: based on the value of hoodie.table.version, we need to write 
different set of default table configs, and potentially different default 
values. 

> Introduce table version specific grouping of table configs
> --
>
> Key: HUDI-8103
> URL: https://issues.apache.org/jira/browse/HUDI-8103
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> based on the value of hoodie.table.version, we need to write different set of 
> default table configs, and potentially different default values. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-8103) Introduce table version specific grouping of table configs

2024-08-19 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-8103:


 Summary: Introduce table version specific grouping of table configs
 Key: HUDI-8103
 URL: https://issues.apache.org/jira/browse/HUDI-8103
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: core
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 

*Table Properties* 

All table properties need to be table version aware. 

*Timeline* 

Timeline already has a timeline layout version, which can be extended to write 
older and newer timeline. The ArchivedTimeline can be old style or LSM style, 
depending on table version. Typically, we have only written timeline with 
latest schema, we may have to version the .avsc files themselves and write 
using the table version specific avro specific class? Also we need a flexible 
mapping from action names used in each table version, to handle things like 
replacecommit to cluster.

{*}TBD{*}: whether or not completion time based changes can be retained, 
assuming instant file creation timestamp. 
h4. *FileSystemView*

Need to support file slice grouping based on old/new timeline + file naming 
combination. Also need abstractions for how files are named in each table 
version, to handle things like log file name changes.
h4. *WriteHandle*

This layer may or may not need changes, as base files don't really change. and 
log format is already versioned (see next point). However, it's prudent to 
ensure we have a mechanism for this, since there could be different ways of 
encoding records or footers etc (e.g HFile k-v pairs, ...) 
h4. Metadata table 

Encoding of k-v pairs, their schemas etc.. which partitions are supported in 
what table versions. *TBD* to see if any of the code around recent 
simplification needs to be undone. 
h4. LogFormat Reader/Writer

the blogs and format itself is version, but its not yet tied to the table 
version overall. So we need these links so that the reader can for eg decide 
how to read/assemble log file scanning? FileGroupReader abstractions may need 
to change as well. 
h4. Table Service

This should largely be independent and built on layers above. but some cases 
exist, depending on whether the code relies on format specifics for generating 
plans or execution. This may be a mix of code changes, acceptable behavior 
changes and format specific specializations. 

TBD: to see if behavior like writing rollback block needs to be 

  was:
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer comple

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 

*Table Properties* ** 

All table properties need to be table version aware. 



*Timeline* 

Timeline already has a timeline layout version, which can be extended to write 
older and newer timeline. The ArchivedTimeline can be old style or LSM style, 
depending on table version. Typically, we have only written timeline with 
latest schema, we may have to version the .avsc files themselves and write 
using the table version specific avro specific class? Also we need a flexible 
mapping from action names used in each table version, to handle things like 
replacecommit to cluster. 

{*}TBD{*}: whether or not completion time based changes can be retained, 
assuming instant file creation timestamp. 
h4. *FileSystemView*

Need to support file slice grouping based on old/new timeline + file naming 
combination. Also need abstractions for how files are named in each table 
version, to handle things like log file name changes.
h4. *WriteHandle*

This layer may or may not need changes, as base files don't really change. and 
log format is already versioned (see next point). However, it's prudent to 
ensure we have a mechanism for this, since there could be different ways of 
encoding records or footers etc (e.g HFile k-v pairs, ...) 
h4. Metadata table 

Encoding of k-v pairs, their schemas etc.. which partitions are supported in 
what table versions. *TBD* to see if any of the code around recent 
simplification needs to be undone. 
h4. LogFormat Reader/Writer

the blogs and format itself is version, but its not yet tied to the table 
version overall. So we need these links so that the reader can for eg decide 
how to read/assemble log file scanning? FileGroupReader abstractions may need 
to change as well. 
h4. Table Service

This should largely be independent and built on layers above. but some cases 
exist, depending on whether the code relies on format specifics for generating 
plans or execution. This may be a mix of code changes, acceptable behavior 
changes and format specific specializations. 

TBD: to see if behavior like writing rollback block needs to be 

  was:
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer 

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 

Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. 
{*}TBD{*}: whether or not completion time based changes can be retained, 
assuming instant file creation timestamp. 

 

WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 

 

Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. *TBD* to see if any of the code around 
recent simplification needs to be undone. 

 

LogFormat Reader/Writer : the blogs and format itself is version, but its not 
yet tied to the table version overall. So we need these links so that the 
reader can for eg decide how to read/assemble log file scanning? 
FileGroupReader abstractions may need to change as well. 

  was:
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. {*}TBD{*}: whether or not 
completion time based changes can be retained, assuming instant file creation 
timestamp. 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's p

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
*(!) Work in Progress.* 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions (0.14/6, 1.0/7,8). This enables older readers to continue 
reading the table even when writers are upgraded, as long as writer produces 
storage compatible with the latest table version the reader can read. The 
readers can then be rolling upgraded easily, at different cadence without need 
for any tight co-ordination. Additionally, the reader should have ability to 
"dynamically" deduce table version based on table properties, such that when 
the writer is switched to the latest table version, subsequent reads will just 
adapt and read it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. {*}TBD{*}: whether or not 
completion time based changes can be retained, assuming instant file creation 
timestamp. 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
 # Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. *TBD* to see if any of the code around 
recent simplification needs to be undone. 
 # LogFormat Reader/Writer : the blogs and format itself is version, but its 
not yet tied to the table version overall. So we need these links so that the 
reader can for eg decide how to read/assemble log file scanning? 
FileGroupReader abstractions may need to change as well. 

  was:
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. {*}tbd{*}: whether or not 
completion time based changes 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different wa

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. {*}tbd{*}: whether or not 
completion time based changes 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
 # Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. *TBD* to see if any of the code thatt

  was:
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. tbd: whether or not completion 
time based changes 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
 # Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. TBD to see if any of the code thatt


> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 

Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. Writers/table services should fail if working with an unsupported 
table version, without which one cannot start switching writers to new version 
(this may still need a minor release on the last 2-3 table versions?)
h3. High level approach: 

We need to introduce table version aware reading/writing inside the core layers 
of Hudi, as well as query engines like Spark/Flink. 

To this effect : We need a HoodieStorageFormat abstraction that can cover the 
following layers . 
 # Timeline : Timeline already has a timeline layout version, which can be 
extended to write older and newer timeline. The ArchivedTimeline can be old 
style or LSM style, depending on table version. tbd: whether or not completion 
time based changes 
 # WriteHandle : This layer may or may not need changes, as base files don't 
really change. and log format is already versioned (see next point). However, 
it's prudent to ensure we have a mechanism for this, since there could be 
different ways of encoding records or footers etc (e.g HFile k-v pairs, ...) 
 # Metadata table:  Encoding of k-v pairs, their schemas etc.. which partitions 
are supported in what table versions. TBD to see if any of the code thatt

  was:
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 


Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. 
h3. High level approach: 


> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> (!) Work in Progress. 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions. This enables older readers to continue reading the table 
> even when writers are upgraded, as long as writer produces storage compatible 
> with the latest table version the reader can read. The readers can then be 
> rolling upgraded easily, at different cadence without need for any tight 
> co-ordination. Additionally, the reader should have ability to "dynamically" 
> deduce table version based on table properties, such that when the writer is 
> switched to the latest table version, subsequent reads will just adapt and 
> read it as the latest table version. 
> Operators still need to ensure all readers have the latest binary that 
> supports a given table version, before switching the writer to that version. 
> Special consideration to table services, as reader/writer processes, tha

[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to the latest table version, subsequent reads will just adapt and read 
it as the latest table version. 


Operators still need to ensure all readers have the latest binary that supports 
a given table version, before switching the writer to that version. Special 
consideration to table services, as reader/writer processes, that should be 
able manage the tables as well. Queries should gracefully fail during table 
version switches and start eventually succeeding when writer completes 
switching. 
h3. High level approach: 

  was:
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to 
h3.  

High level approach: 


> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> (!) Work in Progress. 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions. This enables older readers to continue reading the table 
> even when writers are upgraded, as long as writer produces storage compatible 
> with the latest table version the reader can read. The readers can then be 
> rolling upgraded easily, at different cadence without need for any tight 
> co-ordination. Additionally, the reader should have ability to "dynamically" 
> deduce table version based on table properties, such that when the writer is 
> switched to the latest table version, subsequent reads will just adapt and 
> read it as the latest table version. 
> Operators still need to ensure all readers have the latest binary that 
> supports a given table version, before switching the writer to that version. 
> Special consideration to table services, as reader/writer processes, that 
> should be able manage the tables as well. Queries should gracefully fail 
> during table version switches and start eventually succeeding when writer 
> completes switching. 
> h3. High level approach: 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. The readers can then be 
rolling upgraded easily, at different cadence without need for any tight 
co-ordination. Additionally, the reader should have ability to "dynamically" 
deduce table version based on table properties, such that when the writer is 
switched to 
h3.  

High level approach: 

  was:
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. Additionally, the reader 
should have ability to "dynamically" deduce table version 
h3.  

High level approach: 


> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> (!) Work in Progress. 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions. This enables older readers to continue reading the table 
> even when writers are upgraded, as long as writer produces storage compatible 
> with the latest table version the reader can read. The readers can then be 
> rolling upgraded easily, at different cadence without need for any tight 
> co-ordination. Additionally, the reader should have ability to "dynamically" 
> deduce table version based on table properties, such that when the writer is 
> switched to 
> h3.  
> High level approach: 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers to continue reading the table 
even when writers are upgraded, as long as writer produces storage compatible 
with the latest table version the reader can read. Additionally, the reader 
should have ability to "dynamically" deduce table version 
h3.  

High level approach: 

  was:
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers can continue to read the table 
even when writers are upgraded. 
h3. 
High level approach: 


> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> (!) Work in Progress. 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions. This enables older readers to continue reading the table 
> even when writers are upgraded, as long as writer produces storage compatible 
> with the latest table version the reader can read. Additionally, the reader 
> should have ability to "dynamically" deduce table version 
> h3.  
> High level approach: 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-8076:
-
Description: 
(!) Work in Progress. 
h3. Basic Idea: 

Introduce support for Hudi writer code to produce storage format for the last 
2-3 table versions. This enables older readers can continue to read the table 
even when writers are upgraded. 
h3. 
High level approach: 

> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> (!) Work in Progress. 
> h3. Basic Idea: 
> Introduce support for Hudi writer code to produce storage format for the last 
> 2-3 table versions. This enables older readers can continue to read the table 
> even when writers are upgraded. 
> h3. 
> High level approach: 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874995#comment-17874995
 ] 

Vinoth Chandar commented on HUDI-8076:
--

I think we can reuse RFC-78 unless we plan to support both a bridge release and 
support for writing older table versions. 

 

> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-8076) RFC for backwards compatible writer mode in Hudi 1.0

2024-08-16 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-8076:


Assignee: Vinoth Chandar

> RFC for backwards compatible writer mode in Hudi 1.0
> 
>
> Key: HUDI-8076
> URL: https://issues.apache.org/jira/browse/HUDI-8076
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7585) Avoid reading log files for resolving schema for _hoodie_operation field

2024-05-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7585:


Assignee: Danny Chen

> Avoid reading log files for resolving schema for _hoodie_operation field
> 
>
> Key: HUDI-7585
> URL: https://issues.apache.org/jira/browse/HUDI-7585
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6713) Redesign CDC workload to include partition column for partition pruning

2024-05-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-6713:


Assignee: Danny Chen  (was: Vinoth Chandar)

> Redesign CDC workload to include partition column for partition pruning
> ---
>
> Key: HUDI-6713
> URL: https://issues.apache.org/jira/browse/HUDI-6713
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Aditya Goenka
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently our cdc columns have all the columns nested under before and after. 
> Redesign schema to pull up partition column outside to support partition 
> pruning on read side.
>  
> Github issue - [https://github.com/apache/hudi/issues/9426]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6778) Track schema in metadata table

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6778:
-
Status: In Progress  (was: Open)

> Track schema in metadata table
> --
>
> Key: HUDI-6778
> URL: https://issues.apache.org/jira/browse/HUDI-6778
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates

2024-05-06 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844069#comment-17844069
 ] 

Vinoth Chandar commented on HUDI-7234:
--

this to be handled in 1.1.0 along with partial update encoding cross records

> Handle both inserts and updates in log blocks for partial updates
> -
>
> Key: HUDI-7234
> URL: https://issues.apache.org/jira/browse/HUDI-7234
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Inserts can be written to log blocks, e.g., Flink.  We need to handle such 
> case for partial updates i.e mix of inserts and partial updates to the same 
> data block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7234:
-
Status: Open  (was: In Progress)

> Handle both inserts and updates in log blocks for partial updates
> -
>
> Key: HUDI-7234
> URL: https://issues.apache.org/jira/browse/HUDI-7234
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Inserts can be written to log blocks, e.g., Flink.  We need to handle such 
> case for partial updates i.e mix of inserts and partial updates to the same 
> data block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7234:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Handle both inserts and updates in log blocks for partial updates
> -
>
> Key: HUDI-7234
> URL: https://issues.apache.org/jira/browse/HUDI-7234
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Inserts can be written to log blocks, e.g., Flink.  We need to handle such 
> case for partial updates i.e mix of inserts and partial updates to the same 
> data block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7541) Ensure extensibility to new indexes - vectors, search and other formats (CLP, unstructured data)

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7541:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Ensure extensibility to new indexes - vectors, search and other formats (CLP, 
> unstructured data)
> 
>
> Key: HUDI-7541
> URL: https://issues.apache.org/jira/browse/HUDI-7541
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7541) Ensure extensibility to new indexes - vectors, search and other formats (CLP, unstructured data)

2024-05-06 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844068#comment-17844068
 ] 

Vinoth Chandar commented on HUDI-7541:
--

Punting this to 1.1

> Ensure extensibility to new indexes - vectors, search and other formats (CLP, 
> unstructured data)
> 
>
> Key: HUDI-7541
> URL: https://issues.apache.org/jira/browse/HUDI-7541
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7679) Ensure extensibility to unstructured data, logs (CLP), vectors, other index types

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-7679.

Resolution: Duplicate

> Ensure extensibility to unstructured data, logs (CLP), vectors, other index 
> types
> -
>
> Key: HUDI-7679
> URL: https://issues.apache.org/jira/browse/HUDI-7679
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7538:
-
Fix Version/s: (was: 1.1.0)

> Consolidate the CDC Formats (changelog format, RFC-51)
> --
>
> Key: HUDI-7538
> URL: https://issues.apache.org/jira/browse/HUDI-7538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode 
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
> debezium style change log (currently supported for CoW for Spark/Flink)
>  
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource 
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the 
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the 
> following changes to incorporated for supporting existing users/usage of 
> changelog. CDC format is more generalized in the database world. It offers 
> advantages like not requiring further down-stream processing to say stitch 
> together +U and -U, to update a downstream table. for e.g a field that 
> changed is a key in a downstream table, so we need both +U and -U to compute 
> the updates. 
>  
> (A) Introduce a new "changelog" output mode for CDC queries, which generates 
> I,+U,-U,D format that changelog needs (this can be constructed easily by 
> processing the output of CDC query as follows)
>  * when before is `null`, emit I
>  * when after is `null`, emit D
>  * when both are non-null, emit two records +U and -U
> (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops 
> publishing to _hoodie_operation field 
>  # this means, anyone querying this field, using a snapshot query, will break.
>  # we will bring this back in 1.1 etc, based on user feedback as a 
> hidden/field in the FlinkCatalog.
> (C) To support backwards compatibilty, we fallback to reading 
> `_hoodie_operation` in 0.X tables. 
> For CDC reads, we use first use the CDC log if its avaible for that file 
> slice. If not and base file schema has {{_hoodie_operation}} already, we 
> fallback to reading {{_hoodie_operation}} from base file if 
> mode=OP_KEY_ONLY.. Throw error for other modes. 
> (D) Snapshot queries from spark, presto, trino etc all work with tables, that 
> have `_hoodie_operation` published. 
>  This is already completed for Spark. so others should be easy to do. 
>  
> (E) We need to complete a review of the CDC schema
> ts - should be completion time or instant time?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7679) Ensure extensibility to unstructured data, logs (CLP), vectors, other index types

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-7679.
--

> Ensure extensibility to unstructured data, logs (CLP), vectors, other index 
> types
> -
>
> Key: HUDI-7679
> URL: https://issues.apache.org/jira/browse/HUDI-7679
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7234:
-
Status: In Progress  (was: Open)

> Handle both inserts and updates in log blocks for partial updates
> -
>
> Key: HUDI-7234
> URL: https://issues.apache.org/jira/browse/HUDI-7234
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Inserts can be written to log blocks, e.g., Flink.  We need to handle such 
> case for partial updates i.e mix of inserts and partial updates to the same 
> data block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7541) Ensure extensibility to new indexes - vectors, search and other formats (CLP, unstructured data)

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7541:
-
Status: In Progress  (was: Open)

> Ensure extensibility to new indexes - vectors, search and other formats (CLP, 
> unstructured data)
> 
>
> Key: HUDI-7541
> URL: https://issues.apache.org/jira/browse/HUDI-7541
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7679) Ensure extensibility to unstructured data, logs (CLP), vectors, other index types

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7679:
-
Status: In Progress  (was: Open)

> Ensure extensibility to unstructured data, logs (CLP), vectors, other index 
> types
> -
>
> Key: HUDI-7679
> URL: https://issues.apache.org/jira/browse/HUDI-7679
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)

2024-05-06 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7538:
-
Fix Version/s: 1.1.0

> Consolidate the CDC Formats (changelog format, RFC-51)
> --
>
> Key: HUDI-7538
> URL: https://issues.apache.org/jira/browse/HUDI-7538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.1.0, 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode 
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
> debezium style change log (currently supported for CoW for Spark/Flink)
>  
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource 
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the 
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the 
> following changes to incorporated for supporting existing users/usage of 
> changelog. CDC format is more generalized in the database world. It offers 
> advantages like not requiring further down-stream processing to say stitch 
> together +U and -U, to update a downstream table. for e.g a field that 
> changed is a key in a downstream table, so we need both +U and -U to compute 
> the updates. 
>  
> (A) Introduce a new "changelog" output mode for CDC queries, which generates 
> I,+U,-U,D format that changelog needs (this can be constructed easily by 
> processing the output of CDC query as follows)
>  * when before is `null`, emit I
>  * when after is `null`, emit D
>  * when both are non-null, emit two records +U and -U
> (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops 
> publishing to _hoodie_operation field 
>  # this means, anyone querying this field, using a snapshot query, will break.
>  # we will bring this back in 1.1 etc, based on user feedback as a 
> hidden/field in the FlinkCatalog.
> (C) To support backwards compatibilty, we fallback to reading 
> `_hoodie_operation` in 0.X tables. 
> For CDC reads, we use first use the CDC log if its avaible for that file 
> slice. If not and base file schema has {{_hoodie_operation}} already, we 
> fallback to reading {{_hoodie_operation}} from base file if 
> mode=OP_KEY_ONLY.. Throw error for other modes. 
> (D) Snapshot queries from spark, presto, trino etc all work with tables, that 
> have `_hoodie_operation` published. 
>  This is already completed for Spark. so others should be easy to do. 
>  
> (E) We need to complete a review of the CDC schema
> ts - should be completion time or instant time?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-7540) Check for gaps on storing inserts on log files

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-7540.
--

> Check for gaps on storing inserts on log files
> --
>
> Key: HUDI-7540
> URL: https://issues.apache.org/jira/browse/HUDI-7540
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7234) Handle both inserts and updates in log blocks for partial updates

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7234:
-
Description: Inserts can be written to log blocks, e.g., Flink.  We need to 
handle such case for partial updates i.e mix of inserts and partial updates to 
the same data block.   (was: Inserts can be written to log blocks, e.g., Flink. 
 We need to handle such case for partial updates.)

> Handle both inserts and updates in log blocks for partial updates
> -
>
> Key: HUDI-7234
> URL: https://issues.apache.org/jira/browse/HUDI-7234
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Inserts can be written to log blocks, e.g., Flink.  We need to handle such 
> case for partial updates i.e mix of inserts and partial updates to the same 
> data block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7540) Check for gaps on storing inserts on log files

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-7540.

Resolution: Invalid

> Check for gaps on storing inserts on log files
> --
>
> Key: HUDI-7540
> URL: https://issues.apache.org/jira/browse/HUDI-7540
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7229) Enable partial updates for CDC work payload

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7229:
-
Description: 
OLTP workloads on upstream databases, often update/delete/insert different 
columns in the table on each operation. Currently, Hudi can only supporting 
partial updates in cases where the same columns are being mutated in a given 
write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we 
explore what it takes to support a smarter storage format, that can only encode 
the changed columns into log along with the different implementations.
h2. Goals
 # Enable partial update functionality for all existing and potential future 
CDC workloads without huge modification or duplication.
 # Performance parity with current full-record updates or partial updates 
across the same set of columns
 # Exhibit reduction in storage costs, by only storing the changed columns.
 # Should also result in computation cost reductions by scanning/processing 
less data
 # Should not affect the scalability of the existing system ingestion system. 
The number of files generated for partial update should not increase 
dramatically.

 

  was:DMS, Debezium, etc.


> Enable partial updates for CDC work payload
> ---
>
> Key: HUDI-7229
> URL: https://issues.apache.org/jira/browse/HUDI-7229
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Lin Liu
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> OLTP workloads on upstream databases, often update/delete/insert different 
> columns in the table on each operation. Currently, Hudi can only supporting 
> partial updates in cases where the same columns are being mutated in a given 
> write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we 
> explore what it takes to support a smarter storage format, that can only 
> encode the changed columns into log along with the different implementations.
> h2. Goals
>  # Enable partial update functionality for all existing and potential future 
> CDC workloads without huge modification or duplication.
>  # Performance parity with current full-record updates or partial updates 
> across the same set of columns
>  # Exhibit reduction in storage costs, by only storing the changed columns.
>  # Should also result in computation cost reductions by scanning/processing 
> less data
>  # Should not affect the scalability of the existing system ingestion system. 
> The number of files generated for partial update should not increase 
> dramatically.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7229) Enable partial updates for CDC work payload

2024-05-02 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843110#comment-17843110
 ] 

Vinoth Chandar commented on HUDI-7229:
--

Punting this to 1.1 


 # [1.1] Implement support on top of data blocks.
 ## we need to pass change columns information and operation all the way to 
write handles, using a field in HoodieRecord
 ## ... 
 # [1.1] Implement support on top of cdc data blocks.
 ## we can track similar bitmaps for cdc data blocks as well
 ## we need to extend the new file group reader to also merge base and cdc 
blocks. (not just base and data blocks).

> Enable partial updates for CDC work payload
> ---
>
> Key: HUDI-7229
> URL: https://issues.apache.org/jira/browse/HUDI-7229
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Lin Liu
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> OLTP workloads on upstream databases, often update/delete/insert different 
> columns in the table on each operation. Currently, Hudi can only supporting 
> partial updates in cases where the same columns are being mutated in a given 
> write to Hudi (e.g Spark SQL ETLs with MIT or Update statements). Here, we 
> explore what it takes to support a smarter storage format, that can only 
> encode the changed columns into log along with the different implementations.
> h2. Goals
>  # Enable partial update functionality for all existing and potential future 
> CDC workloads without huge modification or duplication.
>  # Performance parity with current full-record updates or partial updates 
> across the same set of columns
>  # Exhibit reduction in storage costs, by only storing the changed columns.
>  # Should also result in computation cost reductions by scanning/processing 
> less data
>  # Should not affect the scalability of the existing system ingestion system. 
> The number of files generated for partial update should not increase 
> dramatically.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7229) Enable partial updates for CDC work payload

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7229:
-
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Enable partial updates for CDC work payload
> ---
>
> Key: HUDI-7229
> URL: https://issues.apache.org/jira/browse/HUDI-7229
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Lin Liu
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> DMS, Debezium, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7671) Make Hudi timeline backward compatible

2024-05-02 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843109#comment-17843109
 ] 

Vinoth Chandar commented on HUDI-7671:
--

balaji - this may be a dupe. 

> Make Hudi timeline backward compatible
> --
>
> Key: HUDI-7671
> URL: https://issues.apache.org/jira/browse/HUDI-7671
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: compatibility
> Fix For: 1.0.0
>
>
> Since release 1.x, the timeline metadata file name is changed to include the 
> completion time, we need to keep compatibility for 0.x branches/releases.
> 0.x meta file name pattern: ${instant_time}.action[.state]
> 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state].
> In 1.x release, while decipher the Hudi instant from the metadata files, if 
> there is no completion time, uses the file modification time as the 
> completion time instead.
> The modification time follows the OCC concurrency control semantics if the 
> files were not moved around.
> Caution that if the table is a MOR table and the files got moved in history 
> from old folder to the current folder, the reader view may represent wong 
> result set because the completion time are completely the same for all the 
> alive instants.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7671) Make Hudi timeline backward compatible

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7671:
-
Epic Link: HUDI-6242

> Make Hudi timeline backward compatible
> --
>
> Key: HUDI-7671
> URL: https://issues.apache.org/jira/browse/HUDI-7671
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: compatibility
> Fix For: 1.0.0
>
>
> Since release 1.x, the timeline metadata file name is changed to include the 
> completion time, we need to keep compatibility for 0.x branches/releases.
> 0.x meta file name pattern: ${instant_time}.action[.state]
> 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state].
> In 1.x release, while decipher the Hudi instant from the metadata files, if 
> there is no completion time, uses the file modification time as the 
> completion time instead.
> The modification time follows the OCC concurrency control semantics if the 
> files were not moved around.
> Caution that if the table is a MOR table and the files got moved in history 
> from old folder to the current folder, the reader view may represent wong 
> result set because the completion time are completely the same for all the 
> alive instants.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7671) Make Hudi timeline backward compatible

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7671:


Assignee: Balaji Varadarajan  (was: Danny Chen)

> Make Hudi timeline backward compatible
> --
>
> Key: HUDI-7671
> URL: https://issues.apache.org/jira/browse/HUDI-7671
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: compatibility
> Fix For: 1.0.0
>
>
> Since release 1.x, the timeline metadata file name is changed to include the 
> completion time, we need to keep compatibility for 0.x branches/releases.
> 0.x meta file name pattern: ${instant_time}.action[.state]
> 1.x meta file name pattern: ${instant_time}_${completion_time}.action[.state].
> In 1.x release, while decipher the Hudi instant from the metadata files, if 
> there is no completion time, uses the file modification time as the 
> completion time instead.
> The modification time follows the OCC concurrency control semantics if the 
> files were not moved around.
> Caution that if the table is a MOR table and the files got moved in history 
> from old folder to the current folder, the reader view may represent wong 
> result set because the completion time are completely the same for all the 
> alive instants.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7678) Finalize the Merger APIs and make a plan for moving over all existing built-in, custom payloads.

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7678:
-
Description: 
With the move towards making partial updates a first class citizen, that does 
not need any special payloads/merges, we need to move the CDC payloads to all 
be transformers in Hudi Streamer and SQL write path. Along with migration 
instructions to users. 
 # partial update has been implemented for Spark SQL source as follows:
 ## Configuration \{{ hoodie.write.partial.update.schema }} is used for partial 
update.
 ## {{ExpressionPayload}} creates the writer schema based on the configuration.
 ## {{HoodieAppendHandle}} creates the log file based on the confgiuration and 
the corresponding partial schema.
 ## Currently this handle assumes these records are all update records.
 ## We need to understand if ExpressionPayload/SQL Merger is needed to going 
forward. 
 # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., 
Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. Therefore,
 ## The {{transformer}} in DeltaStreamer prepares the data according to the 
types of the sources.
 ## Initially, its okay to just support full row updates/deletes/... 
 # Audit all of them should properly combine I/U/D into data and delete blocks, 
such that U after D, D after U scenarios are handled as expected.

  was:
With the move towards making partial updates a first class citizen, that does 
not need any special payloads/merges, we need to move the CDC payloads to all 
be transformers in Hudi Streamer and SQL write path. Along with migration 
instructions to users. 


 # partial update has been implemented for Spark SQL source as follows:
 ## Configuration {{ hoodie.write.partial.update.schema }} is used for partial 
update.
 ## {{ExpressionPayload}} creates the writer schema based on the configuration.
 ## {{HoodieAppendHandle}} creates the log file based on the confgiuration and 
the corresponding partial schema.
 ## Currently this handle assumes these records are all update records.
 ## We need to understand if ExpressionPayload/SQL Merger is needed to going 
forward. 
 # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., 
Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. Therefore,
 ## The {{transformer}} in DeltaStreamer prepares the data according to the 
types of the sources.
 ## Initially, its okay to just support full row updates/deletes/... 


> Finalize the Merger APIs and make a plan for moving over all existing 
> built-in, custom payloads.
> 
>
> Key: HUDI-7678
> URL: https://issues.apache.org/jira/browse/HUDI-7678
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> With the move towards making partial updates a first class citizen, that does 
> not need any special payloads/merges, we need to move the CDC payloads to all 
> be transformers in Hudi Streamer and SQL write path. Along with migration 
> instructions to users. 
>  # partial update has been implemented for Spark SQL source as follows:
>  ## Configuration \{{ hoodie.write.partial.update.schema }} is used for 
> partial update.
>  ## {{ExpressionPayload}} creates the writer schema based on the 
> configuration.
>  ## {{HoodieAppendHandle}} creates the log file based on the confgiuration 
> and the corresponding partial schema.
>  ## Currently this handle assumes these records are all update records.
>  ## We need to understand if ExpressionPayload/SQL Merger is needed to going 
> forward. 
>  # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., 
> Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. 
> Therefore,
>  ## The {{transformer}} in DeltaStreamer prepares the data according to the 
> types of the sources.
>  ## Initially, its okay to just support full row updates/deletes/... 
>  # Audit all of them should properly combine I/U/D into data and delete 
> blocks, such that U after D, D after U scenarios are handled as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7678) Finalize the Merger APIs and make a plan for moving over all existing built-in, custom payloads.

2024-05-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7678:
-
Description: 
With the move towards making partial updates a first class citizen, that does 
not need any special payloads/merges, we need to move the CDC payloads to all 
be transformers in Hudi Streamer and SQL write path. Along with migration 
instructions to users. 


 # partial update has been implemented for Spark SQL source as follows:
 ## Configuration {{ hoodie.write.partial.update.schema }} is used for partial 
update.
 ## {{ExpressionPayload}} creates the writer schema based on the configuration.
 ## {{HoodieAppendHandle}} creates the log file based on the confgiuration and 
the corresponding partial schema.
 ## Currently this handle assumes these records are all update records.
 ## We need to understand if ExpressionPayload/SQL Merger is needed to going 
forward. 
 # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., 
Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. Therefore,
 ## The {{transformer}} in DeltaStreamer prepares the data according to the 
types of the sources.
 ## Initially, its okay to just support full row updates/deletes/... 

> Finalize the Merger APIs and make a plan for moving over all existing 
> built-in, custom payloads.
> 
>
> Key: HUDI-7678
> URL: https://issues.apache.org/jira/browse/HUDI-7678
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 1.0.0
>
>
> With the move towards making partial updates a first class citizen, that does 
> not need any special payloads/merges, we need to move the CDC payloads to all 
> be transformers in Hudi Streamer and SQL write path. Along with migration 
> instructions to users. 
>  # partial update has been implemented for Spark SQL source as follows:
>  ## Configuration {{ hoodie.write.partial.update.schema }} is used for 
> partial update.
>  ## {{ExpressionPayload}} creates the writer schema based on the 
> configuration.
>  ## {{HoodieAppendHandle}} creates the log file based on the confgiuration 
> and the corresponding partial schema.
>  ## Currently this handle assumes these records are all update records.
>  ## We need to understand if ExpressionPayload/SQL Merger is needed to going 
> forward. 
>  # For DeltaStreamer, our goal is to remove all silo CDC payloads, e.g., 
> Debezium or AWSDMS, and to provide CDC data as {{InternalRow}} type. 
> Therefore,
>  ## The {{transformer}} in DeltaStreamer prepares the data according to the 
> types of the sources.
>  ## Initially, its okay to just support full row updates/deletes/... 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7665) Rolling upgrade of 1.0

2024-05-01 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7665:
-
Description: 
We need to update the table version due to the format changes in 1.0.


| * Plan to get 1.x readers to read 0.x tables
 * Rollout plan for users
 *  [Writer side] Migrating log files, timeline, metadata
 * Migrating table properties including key generators, payloads.
 * Supporting all the existing payloads or migrating them
 * Call out breaking changes
 * Call out behavior changes|

  was:We need to update the table version due to the format changes in 1.0.


> Rolling upgrade of 1.0 
> ---
>
> Key: HUDI-7665
> URL: https://issues.apache.org/jira/browse/HUDI-7665
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> We need to update the table version due to the format changes in 1.0.
> | * Plan to get 1.x readers to read 0.x tables
>  * Rollout plan for users
>  *  [Writer side] Migrating log files, timeline, metadata
>  * Migrating table properties including key generators, payloads.
>  * Supporting all the existing payloads or migrating them
>  * Call out breaking changes
>  * Call out behavior changes|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7665) Rolling upgrade of 1.0

2024-05-01 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-7665:


Assignee: Balaji Varadarajan

> Rolling upgrade of 1.0 
> ---
>
> Key: HUDI-7665
> URL: https://issues.apache.org/jira/browse/HUDI-7665
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> We need to update the table version due to the format changes in 1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7665) Rolling upgrade of 1.0

2024-05-01 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7665:
-
Summary: Rolling upgrade of 1.0   (was: Upgrade Table Version)

> Rolling upgrade of 1.0 
> ---
>
> Key: HUDI-7665
> URL: https://issues.apache.org/jira/browse/HUDI-7665
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> We need to update the table version due to the format changes in 1.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)

2024-05-01 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7538:
-
Description: 
For sake of more consistency, we need to consolidate the the changelog mode 
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource 
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium 
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|

This proposal is to converge onto "CDC" as the path going forward, with the 
following changes to incorporated for supporting existing users/usage of 
changelog. CDC format is more generalized in the database world. It offers 
advantages like not requiring further down-stream processing to say stitch 
together +U and -U, to update a downstream table. for e.g a field that changed 
is a key in a downstream table, so we need both +U and -U to compute the 
updates. 

 

(A) Introduce a new "changelog" output mode for CDC queries, which generates 
I,+U,-U,D format that changelog needs (this can be constructed easily by 
processing the output of CDC query as follows)
 * when before is `null`, emit I
 * when after is `null`, emit D
 * when both are non-null, emit two records +U and -U

(B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops 
publishing to _hoodie_operation field 
 # this means, anyone querying this field, using a snapshot query, will break.
 # we will bring this back in 1.1 etc, based on user feedback as a hidden/field 
in the FlinkCatalog.

(C) To support backwards compatibilty, we fallback to reading 
`_hoodie_operation` in 0.X tables. 

For CDC reads, we use first use the CDC log if its avaible for that file slice. 
If not and base file schema has {{_hoodie_operation}} already, we fallback to 
reading {{_hoodie_operation}} from base file if mode=OP_KEY_ONLY.. Throw error 
for other modes. 



(D) Snapshot queries from spark, presto, trino etc all work with tables, that 
have `_hoodie_operation` published. 

 This is already completed for Spark. so others should be easy to do. 

 

(E) We need to complete a review of the CDC schema

ts - should be completion time or instant time?

 

 

  was:
For sake of more consistency, we need to consolidate the the changelog mode 
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource 
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium 
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|

This proposal is to converge onto "CDC" as the path going forward, with the 
following changes to incorporate. 

 

 

 

 

 


> Consolidate the CDC Formats (changelog format, RFC-51)
> --
>
> Key: HUDI-7538
> URL: https://issues.apache.org/jira/browse/HUDI-7538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode 
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
> debezium style change log (currently supported for CoW for Spark/Flink)
>  
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource 
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the 
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the 
> following changes to incorporated for supporting existing users/usage of 
> changelog. CDC format is more generalized in the database world. It offers 
> advantages like not requiring further down-stream processing to say stitch 
> together +U and -U, to update a downstream table. for e.g a field that 
> changed is a key in a downstream table, so we need both +U and -U to compute 
> the updates. 
>  
> (A) Introduce a new "changelog" output mode for CDC queries, which generates 
> I,+U,-U,D format that changelog needs (this can be constructed easily by 
> processing the output of CDC query as follows)
>  * when before is `null`, emit I
>  * when after is `null`, emit D
>  * when both are non-null, emit two records +U and -U
> (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops 
> publishing to _hoodi

[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)

2024-05-01 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7538:
-
Reviewers: Danny Chen, Ethan Guo

> Consolidate the CDC Formats (changelog format, RFC-51)
> --
>
> Key: HUDI-7538
> URL: https://issues.apache.org/jira/browse/HUDI-7538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode 
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
> debezium style change log (currently supported for CoW for Spark/Flink)
>  
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource 
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the 
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the 
> following changes to incorporated for supporting existing users/usage of 
> changelog. CDC format is more generalized in the database world. It offers 
> advantages like not requiring further down-stream processing to say stitch 
> together +U and -U, to update a downstream table. for e.g a field that 
> changed is a key in a downstream table, so we need both +U and -U to compute 
> the updates. 
>  
> (A) Introduce a new "changelog" output mode for CDC queries, which generates 
> I,+U,-U,D format that changelog needs (this can be constructed easily by 
> processing the output of CDC query as follows)
>  * when before is `null`, emit I
>  * when after is `null`, emit D
>  * when both are non-null, emit two records +U and -U
> (B) New writes in 1.0 will *ONLY* produce .cdc changelog format, and stops 
> publishing to _hoodie_operation field 
>  # this means, anyone querying this field, using a snapshot query, will break.
>  # we will bring this back in 1.1 etc, based on user feedback as a 
> hidden/field in the FlinkCatalog.
> (C) To support backwards compatibilty, we fallback to reading 
> `_hoodie_operation` in 0.X tables. 
> For CDC reads, we use first use the CDC log if its avaible for that file 
> slice. If not and base file schema has {{_hoodie_operation}} already, we 
> fallback to reading {{_hoodie_operation}} from base file if 
> mode=OP_KEY_ONLY.. Throw error for other modes. 
> (D) Snapshot queries from spark, presto, trino etc all work with tables, that 
> have `_hoodie_operation` published. 
>  This is already completed for Spark. so others should be easy to do. 
>  
> (E) We need to complete a review of the CDC schema
> ts - should be completion time or instant time?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)

2024-05-01 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7538:
-
Description: 
For sake of more consistency, we need to consolidate the the changelog mode 
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource 
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium 
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|

This proposal is to converge onto "CDC" as the path going forward, with the 
following changes to incorporate. 

 

 

 

 

 

  was:
For sake of more consistency, we need to consolidate the the changelog mode 
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource 
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium 
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|


This proposal is to converge onto "CDC" as the path going forward, with the 
following changes to incorporate. 







 

 

 


> Consolidate the CDC Formats (changelog format, RFC-51)
> --
>
> Key: HUDI-7538
> URL: https://issues.apache.org/jira/browse/HUDI-7538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode 
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
> debezium style change log (currently supported for CoW for Spark/Flink)
>  
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource 
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the 
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the 
> following changes to incorporate. 
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7538) Consolidate the CDC Formats (changelog format, RFC-51)

2024-05-01 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7538:
-
Description: 
For sake of more consistency, we need to consolidate the the changelog mode 
(currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
debezium style change log (currently supported for CoW for Spark/Flink)

 
|Format Name|CDC Source Required|Resource Cost(writer)|Resource 
Cost(reader)|Friendly to Streaming|
|CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the debezium 
style output is not what Flink needs for e.g)|
|Changelog|Yes|low|low|Yes|


This proposal is to converge onto "CDC" as the path going forward, with the 
following changes to incorporate. 







 

 

 

> Consolidate the CDC Formats (changelog format, RFC-51)
> --
>
> Key: HUDI-7538
> URL: https://issues.apache.org/jira/browse/HUDI-7538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: storage-management
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> For sake of more consistency, we need to consolidate the the changelog mode 
> (currently supported for Flink MoR) and RFC-51 based CDC feature which is a 
> debezium style change log (currently supported for CoW for Spark/Flink)
>  
> |Format Name|CDC Source Required|Resource Cost(writer)|Resource 
> Cost(reader)|Friendly to Streaming|
> |CDC|*No*|low/high|low/high (based on logging modes we choose)|No (the 
> debezium style output is not what Flink needs for e.g)|
> |Changelog|Yes|low|low|Yes|
> This proposal is to converge onto "CDC" as the path going forward, with the 
> following changes to incorporate. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files

2024-04-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6712:
-
Status: Open  (was: Patch Available)

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6700) Archiving should be time based, not this min-max and not per instant. Lets treat it like a log (Phase 2)

2024-04-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6700:
-
Status: Open  (was: In Progress)

> Archiving should be time based, not this min-max and not per instant. Lets 
> treat it like a log (Phase 2)
> 
>
> Key: HUDI-6700
> URL: https://issues.apache.org/jira/browse/HUDI-6700
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842466#comment-17842466
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:26 PM:
---

h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) and a new "retain" command block - 
in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 
{code:java}
pointer data block {
   [
{fg1, logfile X, ..},
{fg2, logfile Y, ..},
{fg3, logfile Z, ..}
   ]
}
{code}
In this approach, instead of redistributing the records from file groups 
f1,f2,f3 to f4,f5 - we will
 * log pointer data blocks to f4, f5 , pointing back to new log files W wrote 
to file groups f1,f2,f3. (X,Y,Z in the example above) 
 * log retain command blocks to f1 (with logfilex),f2 (with logfileY),f3 (with 
logfile z) to indicate to the cleaner service, that these log files are 
supposed to be retained for later, so it can be skipped.
 * When f4,f5 is compacted, these log files  will be reconciled to new file 
slices in f4, f5 
 * When the file slices in f4 and f5 are cleaned, the pointers are followed and 
the retained logfiles in f1,f2,f3 can also be cleaned. (note: need some 
reference counting in case f4 is cleaned while f5 is not)

Note that there is a 1:N relationship between pointer data block and log files. 
For e.g both f4, f5's pointer data blocks will be pointing to all three files 
X,Y,Z. 

A snapshot read of the file groups f4, f5 need to carefully filter records to 
avoid exposing duplicate records to the reader. 


 * merges the log files pointed to by pointer data blocks based on keys i.e 
only delete/update records in the base file for keys that match from the 
pointed log files.
 * inserts need to be handled with care and need to be distinguishable from an 
update e.g a record in pointed log files for f4 can either be an insert or a 
update to a record that is now clustered into f5. 
 * storage format also needs to store a bitmap indicating which records are 
inserts vs updates in the pointed to log files. Once such a list of available, 
then the reader of f4/f5 can split the inserts amongst them, by using a hash 
mod of the insert key for e.g i.e f4 will get insert 0, 2, 4, 6 ... in log file 
x,y,z. while f5 will get inserts 1,3,5,7 in log files x,y,z



h3. Pros: 
 * Truly non-blocking, C and W can go at their own pace, complete without much 
additional overhead since the pointer blocks are pretty light to add. 
 * Works even for high throughput streaming scenarios where there is not much 
time or writer to be reconciling (e.g Flink)

h3. Cons:
 * More merge costs on the read/query side (although can be very tolerable for 
same cases approach 1 works well for)  
 * More complexity and new storage changes. 


was (Author: vc):
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 


{code:java}
pointer data block {
   [
{fg1, logfileX, ..},
{fg2, logfileY, ..},
{fg3, logfileX, ..}
   ]
}
{code}

In this approach, instead of redistributing the records from 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be n

[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842466#comment-17842466
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 9:01 PM:
---

h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, while 
minimizing the amount of work redone on the writer side, we need to introduce a 
notion of "pointer data blocks" (name TBD) in Hudi's log format. 

*Pointer data blocks* 
A pointer data block just keeps pointers to other blocks in a different file 
groups. 


{code:java}
pointer data block {
   [
{fg1, logfileX, ..},
{fg2, logfileY, ..},
{fg3, logfileX, ..}
   ]
}
{code}

In this approach, instead of redistributing the records from 

 

 


was (Author: vc):
h2. [WIP] Approach 2 : Introduce pointer data blocks into storage format

if we truly wish to achieve, independent operations of C and W, without 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be not fully sorted by those fields)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1045:
-
Description: 
h4. We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
while a clustering service C  reclusters them into  f4, f5. 

Goals
 * Writes can be either updates, deletes or inserts. 
 * Either clustering C or the writer W can finish first
 * Both W and C need to be able to complete their actions without much redoing 
of work. 
 * The number of output file groups for C can be higher or lower than input 
file groups. 
 * Need to work across and be oblivious to whether the writers are operating in 
OCC or NBCC modes
 * Needs to interplay well with cleaning and compaction services.



h4. Non-goals 
 * Strictly the sort order achieved by clustering, in face of updates (e.g 
updates change clustering field values, causing output clustering file groups 
to be not fully sorted by those fields)

  was:
We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
while a clustering service C  reclusters them into  f4, f5. 
 * Writes can be either updates, deletes or inserts. 
 * Either clustering C or the writer W can finish first
 * Both W and C need to be able to complete their actions without much redoing 
of work. 
 * The number of output file groups for C can be higher or lower than input 
file groups. 
 * Need to work across and be oblivious to whether the writers are operating in 
OCC or NBCC modes
 * Needs to interplay well with cleaning and compaction services.


> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> h4. We need to allow a writer w writing to file groups f1, f2, f3, 
> concurrently while a clustering service C  reclusters them into  f4, f5. 
> Goals
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.
> h4. Non-goals 
>  * Strictly the sort order achieved by clustering, in face of updates (e.g 
> updates change clustering field values, causing output clustering file groups 
> to be not fully sorted by those fields)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 

 

 

 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 

 

 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
 

 
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where 
 # Absorbs any read amplification. 

h3.  

Cons:
 # sort order may be disturbed from the re-distribtion of keys. 
 # Can be pretty wasteful, if 

 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-1045) Support updates during clustering

2024-04-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842465#comment-17842465
 ] 

Vinoth Chandar edited comment on HUDI-1045 at 4/30/24 8:27 PM:
---

h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 


was (Author: vc):
h2.  [WIP] Approach 1 :  Redistribute records from the conflicting file groups 

Within the finalize section (done within a table level distributed lock), we 
could either have W or C perform the following . 
{code:java}
W {
  - identify the file groups that have been clustered concurrently by C
  - Read out all records written by W, into these conflicting file groups
  - Redistribute records based on new records distribution based on C
  - finalize W
} {code}
 
{code:java}
C {
  - identify the file groups that have been written to concurrently by W.
  - Read out all records written by such W, into conflicting file groups
  - Redistribute records based on new records distribution, based on C
  - finalize C
}
{code}
h3. Pros: 
 # Simple to understand/debug, no storage format changes. 
 # Could work well for cases where the overlap between C and W is rather small.
 # No extra read amplification for queries, W/C absorbs tha cost. 

{*}Cons{*}:
 # Can be pretty wasteful in continuous writers or with high overlap between C 
and W, forcing the entire write to be redone effectively (same as writer 
failing and retrying like today)
 # Particularly more expensive for CoW, where W has paid the cost of merging 
columnar base files, with incoming records. 

 

 

 

> Support updates during clustering
> -
>
> Key: HUDI-1045
> URL: https://issues.apache.org/jira/browse/HUDI-1045
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, table-service
>Reporter: leesf
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to allow a writer w writing to file groups f1, f2, f3, concurrently 
> while a clustering service C  reclusters them into  f4, f5. 
>  * Writes can be either updates, deletes or inserts. 
>  * Either clustering C or the writer W can finish first
>  * Both W and C need to be able to complete their actions without much 
> redoing of work. 
>  * The number of output file groups for C can be higher or lower than input 
> file groups. 
>  * Need to work across and be oblivious to whether the writers are operating 
> in OCC or NBCC modes
>  * Needs to interplay well with cleaning and compaction services.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >