[jira] [Closed] (HUDI-7007) Integrate functional index using bloom filter on reader side

2024-06-01 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7007.
-
Resolution: Done

> Integrate functional index using bloom filter on reader side
> 
>
> Key: HUDI-7007
> URL: https://issues.apache.org/jira/browse/HUDI-7007
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Currently, one can create a functional index on a column using bloom filters. 
> However, only the one created using column stats is supported on the reader 
> side (check `FunctionalIndexSupport`). This ticket tracks the support for 
> using bloom filters on functional index in the reader path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7825) Support Report pending clustering and compaction plan metric

2024-06-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7825:
-
Labels: pull-request-available  (was: )

> Support Report pending clustering and compaction plan metric 
> -
>
> Key: HUDI-7825
> URL: https://issues.apache.org/jira/browse/HUDI-7825
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: jack Lei
>Priority: Major
>  Labels: pull-request-available
>
> 1、when just async clustering or async compaction schedule enable, and 
> clustering.async.enabled or compaction.async.enabled  set false, then the 
> flink job will not add clusterPlanOperator or  CompactionPlanOperator
> 2、 but the pending plan metric emit in clusterPlanOperator or 
> CompactionPlanOperator
> 3、so maybe support emit pending plan metric in StreamWriteOperatorCoordinator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7825) Support Report pending clustering and compaction plan metric

2024-06-01 Thread jack Lei (Jira)
jack Lei created HUDI-7825:
--

 Summary: Support Report pending clustering and compaction plan 
metric 
 Key: HUDI-7825
 URL: https://issues.apache.org/jira/browse/HUDI-7825
 Project: Apache Hudi
  Issue Type: Bug
Reporter: jack Lei


1、when just async clustering or async compaction schedule enable, and 

clustering.async.enabled or compaction.async.enabled  set false, then the flink 
job will not add clusterPlanOperator or  CompactionPlanOperator

2、 but the pending plan metric emit in clusterPlanOperator or 
CompactionPlanOperator

3、so maybe support emit pending plan metric in StreamWriteOperatorCoordinator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-06-01 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7824:
-

Assignee: sivabalan narayanan

> Fix incremental partitions fetch logic when savepoint is removed for Incr 
> cleaner
> -
>
> Key: HUDI-7824
> URL: https://issues.apache.org/jira/browse/HUDI-7824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
> and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
> removed later, cleaner should account for cleaning up the commit of interest. 
>  
> Lets ensure clean planner account for all partitions when such savepoint 
> removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7824:
-
Labels: pull-request-available  (was: )

> Fix incremental partitions fetch logic when savepoint is removed for Incr 
> cleaner
> -
>
> Key: HUDI-7824
> URL: https://issues.apache.org/jira/browse/HUDI-7824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
> and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
> removed later, cleaner should account for cleaning up the commit of interest. 
>  
> Lets ensure clean planner account for all partitions when such savepoint 
> removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-05-31 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7824:
-

 Summary: Fix incremental partitions fetch logic when savepoint is 
removed for Incr cleaner
 Key: HUDI-7824
 URL: https://issues.apache.org/jira/browse/HUDI-7824
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning
Reporter: sivabalan narayanan


with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
removed later, cleaner should account for cleaning up the commit of interest. 

 

Lets ensure clean planner account for all partitions when such savepoint 
removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7823) Simplify dependency management on exclusions

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7823:
-
Labels: pull-request-available  (was: )

> Simplify dependency management on exclusions
> 
>
> Key: HUDI-7823
> URL: https://issues.apache.org/jira/browse/HUDI-7823
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7823) Simplify dependency management on exclusions

2024-05-31 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7823:
---

 Summary: Simplify dependency management on exclusions
 Key: HUDI-7823
 URL: https://issues.apache.org/jira/browse/HUDI-7823
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7822:
-
Labels: pull-request-available  (was: )

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851205#comment-17851205
 ] 

Ethan Guo commented on HUDI-7822:
-

https://github.com/apache/hudi/pull/10931

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7822:
---

 Summary: Resolve the conflicts between mixed hdfs and local path 
in Flink tests
 Key: HUDI-7822
 URL: https://issues.apache.org/jira/browse/HUDI-7822
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7822:

Fix Version/s: 1.0.0

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7821) Handle schema evolution in proto to avro conversion

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7821:
-
Labels: pull-request-available  (was: )

> Handle schema evolution in proto to avro conversion
> ---
>
> Key: HUDI-7821
> URL: https://issues.apache.org/jira/browse/HUDI-7821
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> Users can encounter errors when a batch of data was written with an older 
> schema and a new schema has fields that are not present in the old data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7821) Handle schema evolution in proto to avro conversion

2024-05-31 Thread Timothy Brown (Jira)
Timothy Brown created HUDI-7821:
---

 Summary: Handle schema evolution in proto to avro conversion
 Key: HUDI-7821
 URL: https://issues.apache.org/jira/browse/HUDI-7821
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Timothy Brown


Users can encounter errors when a batch of data was written with an older 
schema and a new schema has fields that are not present in the old data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path

2024-05-31 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7811.
-
Resolution: Fixed

Fixed in the original PR itself - 
https://github.com/apache/hudi/pull/11043#discussion_r1621825753

> Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path
> -
>
> Key: HUDI-7811
> URL: https://issues.apache.org/jira/browse/HUDI-7811
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> It will help avoid calling FSUtils.getRelativePartitionPath - 
> https://github.com/apache/hudi/pull/11043#discussion_r1611744651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path

2024-05-31 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7811:
-

Assignee: Sagar Sumit

> Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path
> -
>
> Key: HUDI-7811
> URL: https://issues.apache.org/jira/browse/HUDI-7811
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> It will help avoid calling FSUtils.getRelativePartitionPath - 
> https://github.com/apache/hudi/pull/11043#discussion_r1611744651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7820) For bloom index reader path, prune based on min/max if colstats is enabled

2024-05-31 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7820:
-

 Summary: For bloom index reader path, prune based on min/max if 
colstats is enabled
 Key: HUDI-7820
 URL: https://issues.apache.org/jira/browse/HUDI-7820
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Sagar Sumit
 Fix For: 1.1.0, 1.0.0


Bloom filters can result in false positives. We can try to prune files based on 
min/max if colstats is available for the field. 
https://github.com/apache/hudi/pull/11043#discussion_r1621639791



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7819:
-
Labels: pull-request-available  (was: )

> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
> -
>
> Key: HUDI-7819
> URL: https://issues.apache.org/jira/browse/HUDI-7819
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-30 Thread bradley (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bradley closed HUDI-7810.
-
Resolution: Later

> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
> -
>
> Key: HUDI-7810
> URL: https://issues.apache.org/jira/browse/HUDI-7810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>
> Fixed in PR: [https://github.com/apache/hudi/pull/11359]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-30 Thread bradley (Jira)
bradley created HUDI-7819:
-

 Summary: Fix OptionsResolver#allowCommitOnEmptyBatch default value 
bug
 Key: HUDI-7819
 URL: https://issues.apache.org/jira/browse/HUDI-7819
 Project: Apache Hudi
  Issue Type: Bug
Reporter: bradley






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7818) Flink Table planner not loading problem

2024-05-30 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7818:
-
Sprint: Sprint 2023-04-26

> Flink Table planner not loading problem
> ---
>
> Key: HUDI-7818
> URL: https://issues.apache.org/jira/browse/HUDI-7818
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7818) Flink Table planner not loading problem

2024-05-30 Thread Danny Chen (Jira)
Danny Chen created HUDI-7818:


 Summary: Flink Table planner not loading problem
 Key: HUDI-7818
 URL: https://issues.apache.org/jira/browse/HUDI-7818
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7817:
-
Labels: pull-request-available  (was: )

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of Jackson Core 
> (com.fasterxml.jackson.core:jackson-core).  
> org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
> should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Description: org.codehaus.jackson is a older version of Jackson Core 
(com.fasterxml.jackson.core:jackson-core).  
org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
should be avoided.  (was: org.codehaus.jackson is a older version of Jackson 
Core (com.fasterxml.jackson.core:jackson-core).  
org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
should be avoid.)

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of Jackson Core 
> (com.fasterxml.jackson.core:jackson-core).  
> org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
> should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Description: org.codehaus.jackson is a older version of Jackson Core 
(com.fasterxml.jackson.core:jackson-core).  
org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
should be avoid.  (was: org.codehaus.jackson is a older version of )

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of Jackson Core 
> (com.fasterxml.jackson.core:jackson-core).  
> org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
> should be avoid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Description: org.codehaus.jackson is a older version of 

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7817:
---

Assignee: Ethan Guo

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Fix Version/s: 1.0.0

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

2024-05-30 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7817:
---

 Summary: Use Jackson Core instead of org.codehaus.jackson for JSON 
encoding
 Key: HUDI-7817
 URL: https://issues.apache.org/jira/browse/HUDI-7817
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7816) Pass the source profile to the snapshot query splitter

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7816:
-
Labels: pull-request-available  (was: )

> Pass the source profile to the snapshot query splitter
> --
>
> Key: HUDI-7816
> URL: https://issues.apache.org/jira/browse/HUDI-7816
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7816) Pass the source profile to the snapshot query splitter

2024-05-30 Thread Rajesh Mahindra (Jira)
Rajesh Mahindra created HUDI-7816:
-

 Summary: Pass the source profile to the snapshot query splitter
 Key: HUDI-7816
 URL: https://issues.apache.org/jira/browse/HUDI-7816
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Rajesh Mahindra






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, ear

[jira] [Closed] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs

2024-05-30 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7407.
-
Resolution: Fixed

> Add optional clean support to standalone compaction and clustering jobs
> ---
>
> Key: HUDI-7407
> URL: https://issues.apache.org/jira/browse/HUDI-7407
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Lets add top level config to standalone compaction and clustering job to 
> optionally clean. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7815:
-
Labels: pull-request-available  (was: )

> Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh 
> timeline
> 
>
> Key: HUDI-7815
> URL: https://issues.apache.org/jira/browse/HUDI-7815
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline

2024-05-30 Thread xy (Jira)
xy created HUDI-7815:


 Summary: Multiple writer with bulkinsert 
getAllPendingClusteringPlans should refresh timeline
 Key: HUDI-7815
 URL: https://issues.apache.org/jira/browse/HUDI-7815
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: xy
Assignee: xy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7807:

Sprint: Sprint 2023-04-26

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.

[jira] [Updated] (HUDI-7791) Bump h2 from 1.4.200 to 2.2.220 in /packaging/hudi-metaserver-server-bundle

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7791:

Sprint: Sprint 2023-04-26

> Bump h2 from 1.4.200 to 2.2.220 in /packaging/hudi-metaserver-server-bundle
> ---
>
> Key: HUDI-7791
> URL: https://issues.apache.org/jira/browse/HUDI-7791
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7796) Gracefully cast file system instance in Avro writers

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7796:

Sprint: Sprint 2023-04-26

> Gracefully cast file system instance in Avro writers
> 
>
> Key: HUDI-7796
> URL: https://issues.apache.org/jira/browse/HUDI-7796
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> When running tests in Trino with Hudi MDT enabled, the following line in 
> HoodieAvroHFileWriter throws class cast exception, since Trino uses 
> dependency injection to provide the Hadoop file system instance, which may 
> skip the Hudi wrapper file system logic.
> {code:java}
>     this.fs = (HoodieWrapperFileSystem) this.file.getFileSystem(conf); {code}
> {code:java}
> Caused by: java.lang.ClassCastException: class 
> io.trino.hdfs.TrinoFileSystemCache$FileSystemWrapper cannot be cast to class 
> org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem 
> (io.trino.hdfs.TrinoFileSystemCache$FileSystemWrapper and 
> org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem are in unnamed module of 
> loader 'app')
>     at 
> org.apache.hudi.io.hadoop.HoodieAvroHFileWriter.(HoodieAvroHFileWriter.java:91)
>     at 
> org.apache.hudi.io.hadoop.HoodieAvroFileWriterFactory.newHFileFileWriter(HoodieAvroFileWriterFactory.java:108)
>     at 
> org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriterByFormat(HoodieFileWriterFactory.java:70)
>     at 
> org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriter(HoodieFileWriterFactory.java:53)
>     at 
> org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:108)
>     at 
> org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:77)
>     at 
> org.apache.hudi.io.CreateHandleFactory.create(CreateHandleFactory.java:45)
>     at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:101)
>     at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:44)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7801) Directly pass down HoodieStorage instance instead of recreation

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7801:

Sprint: Sprint 2023-04-26

> Directly pass down HoodieStorage instance instead of recreation
> ---
>
> Key: HUDI-7801
> URL: https://issues.apache.org/jira/browse/HUDI-7801
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> There are places that use HoodieStorage#newInstance to recreate HoodieStorage 
> instance which may not be necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7808:

Sprint: Sprint 2023-04-26

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7769) Fix Hudi CDC read with legacy parquet file format on Spark

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7769:

Sprint: Sprint 2023-04-26

> Fix Hudi CDC read with legacy parquet file format on Spark
> --
>
> Key: HUDI-7769
> URL: https://issues.apache.org/jira/browse/HUDI-7769
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Sprint: Sprint 2023-04-26

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> With Hudi 0.14.1, without 
> "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi 
> query in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the 
> Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration.
> {code:java}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
>   ... 47 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>   at 
> org.apache.spark.sql.execution.datasource

[jira] [Updated] (HUDI-7790) Revert changes in DFSPathSelector and UtilHelpers.readConfig

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7790:

Sprint: Sprint 2023-04-26

> Revert changes in DFSPathSelector and UtilHelpers.readConfig
> 
>
> Key: HUDI-7790
> URL: https://issues.apache.org/jira/browse/HUDI-7790
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> This is to avoid behavior changes in DFSPathSelector and keep the 
> UtilHelpers.readConfig API the same as before.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7792) Bump h2 from 1.4.200 to 2.2.220 in /hudi-platform-service/hudi-metaserver

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7792:

Sprint: Sprint 2023-04-26

> Bump h2 from 1.4.200 to 2.2.220 in /hudi-platform-service/hudi-metaserver
> -
>
> Key: HUDI-7792
> URL: https://issues.apache.org/jira/browse/HUDI-7792
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7784:

Sprint: Sprint 2023-04-26

> Fix serde of HoodieHadoopConfiguration in Spark
> ---
>
> Key: HUDI-7784
> URL: https://issues.apache.org/jira/browse/HUDI-7784
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7785) Keep public APIs in utilities module the same as before HoodieStorage abstraction

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7785:

Sprint: Sprint 2023-04-26

> Keep public APIs in utilities module the same as before HoodieStorage 
> abstraction
> -
>
> Key: HUDI-7785
> URL: https://issues.apache.org/jira/browse/HUDI-7785
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> BaseErrorTableWriter, HoodieStreamer, StreamSync, etc., are public API 
> classes and contain public API methods, which should be kept the same as 
> before.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7794) Bump org.apache.hive:hive-service from 2.3.1 to 2.3.4

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7794:

Sprint: Sprint 2023-04-26

> Bump org.apache.hive:hive-service from 2.3.1 to 2.3.4
> -
>
> Key: HUDI-7794
> URL: https://issues.apache.org/jira/browse/HUDI-7794
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7798) Mark configs included in 0.15.0 release

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7798:

Sprint: Sprint 2023-04-26

> Mark configs included in 0.15.0 release
> ---
>
> Key: HUDI-7798
> URL: https://issues.apache.org/jira/browse/HUDI-7798
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We need to mark the configs that go out in 0.15.0 release with 
> `.sinceVersion("0.15.0")`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7802) Fix bundle validation scripts

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7802:

Sprint: Sprint 2023-04-26

> Fix bundle validation scripts
> -
>
> Key: HUDI-7802
> URL: https://issues.apache.org/jira/browse/HUDI-7802
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Issues:
>  * Bundle validation with packaging/bundle-validation/ci_run.sh fails for 
> release-0.15.0 branch due to script issue
>  * scripts/release/validate_staged_bundles.sh needs to include additional 
> bundles.
>  * Add release candidate validation on scala 2.13 bundles.
>  * Disable release candidate validation by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7814:

Sprint: Sprint 2023-04-26

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7786) Fix roaring bitmap dependency in hudi-integ-test-bundle

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7786:

Sprint: Sprint 2023-04-26

> Fix roaring bitmap dependency in hudi-integ-test-bundle
> ---
>
> Key: HUDI-7786
> URL: https://issues.apache.org/jira/browse/HUDI-7786
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7788) Fixing exception handling in AverageRecordSizeUtils

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7788:

Sprint: Sprint 2023-04-26

> Fixing exception handling in AverageRecordSizeUtils
> ---
>
> Key: HUDI-7788
> URL: https://issues.apache.org/jira/browse/HUDI-7788
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We should catch Throwable to avoid any issue during record size estimation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7777) Allow HoodieTableMetaClient to take HoodieStorage instance directly

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-:

Sprint: Sprint 2023-04-26

>  Allow HoodieTableMetaClient to take HoodieStorage instance directly
> 
>
> Key: HUDI-
> URL: https://issues.apache.org/jira/browse/HUDI-
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> We need to functionality for the meta client to 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7814:

Fix Version/s: 1.0.0
   0.16.0

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7814:
---

Assignee: Ethan Guo

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7814:
-
Labels: pull-request-available  (was: )

> Exclude unused transitive dependencies that introduce vulnerabilities
> -
>
> Key: HUDI-7814
> URL: https://issues.apache.org/jira/browse/HUDI-7814
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities

2024-05-29 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7814:
---

 Summary: Exclude unused transitive dependencies that introduce 
vulnerabilities
 Key: HUDI-7814
 URL: https://issues.apache.org/jira/browse/HUDI-7814
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7211) Relax need of ordering/precombine field for tables with autogenerated record keys for DeltaStreamer

2024-05-29 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850517#comment-17850517
 ] 

sivabalan narayanan commented on HUDI-7211:
---

For auto record key gen, you need to set operation type to "INSERT". Can you 
give that a try. 

> Relax need of ordering/precombine field for tables with autogenerated record 
> keys for DeltaStreamer
> ---
>
> Key: HUDI-7211
> URL: https://issues.apache.org/jira/browse/HUDI-7211
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Priority: Critical
> Fix For: 1.1.0
>
>
> [https://github.com/apache/hudi/issues/10233]
>  
> ```
> NOW=$(date '+%Y%m%dt%H%M%S')
> ${SPARK_HOME}/bin/spark-submit \
> --jars 
> ${path_prefix}/jars/${SPARK_V}/hudi-spark${SPARK_VERSION}-bundle_2.12-${HUDI_VERSION}.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> ${path_prefix}/jars/${SPARK_V}/hudi-utilities-slim-bundle_2.12-${HUDI_VERSION}.jar
>  \
> --target-base-path ${path_prefix}/testcases/stocks/data/target/${NOW} \
> --target-table stocks${NOW} \
> --table-type COPY_ON_WRITE \
> --base-file-format PARQUET \
> --props ${path_prefix}/testcases/stocks/configs/hoodie.properties \
> --source-class org.apache.hudi.utilities.sources.JsonDFSSource \
> --schemaprovider-class 
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
> --hoodie-conf 
> hoodie.deltastreamer.schemaprovider.source.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc
>  \
> --hoodie-conf 
> hoodie.deltastreamer.schemaprovider.target.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc
>  \
> --op UPSERT \
> --spark-master yarn \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=${path_prefix}/testcases/stocks/data/source_without_ts
>  \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date \
> --hoodie-conf hoodie.datasource.write.keygenerator.type=SIMPLE \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \
> --hoodie-conf hoodie.metadata.enable=true
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7813) Hive Style partitioning on a bootstrap table is not configurable

2024-05-29 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7813:
-

 Summary: Hive Style partitioning on a bootstrap table is not 
configurable
 Key: HUDI-7813
 URL: https://issues.apache.org/jira/browse/HUDI-7813
 Project: Apache Hudi
  Issue Type: Bug
  Components: bootstrap
Reporter: Jonathan Vexler


I modified DecodedBootstrapPartitionPathTranslator to be:
{code:java}
public class DecodedBootstrapPartitionPathTranslator extends 
BootstrapPartitionPathTranslator {
  public DecodedBootstrapPartitionPathTranslator() {
super();
  }

  @Override
  public String getBootstrapTranslatedPath(String bootStrapPartitionPath) {
String pathMaybeWithHive = 
PartitionPathEncodeUtils.unescapePathName(bootStrapPartitionPath);
if (pathMaybeWithHive.contains("=")) {
  return Arrays.stream(pathMaybeWithHive.split("/")).map(split -> {
if (split.contains("=")) {
  return split.split("=")[1];
} else {
  return split;
}
  }).collect(Collectors.joining("/"));
}
return pathMaybeWithHive;
  }
} {code}
And setting hive style partitioning to true does not add it back



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7812:
-

Assignee: sivabalan narayanan

> Async Clustering w/ row writer fails due to timetravel query validation 
> 
>
> Key: HUDI-7812
> URL: https://issues.apache.org/jira/browse/HUDI-7812
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> With clustering row writer enabled flow, we trigger a time travel query to 
> read input records. But the query side fails if there are any pending commits 
> (due to new ingestion ) whose timestamp < clustering instant time. we need to 
> relax this constraint. 
>  
> {code:java}
> Failed to execute CLUSTERING service
>     java.util.concurrent.CompletionException: 
> org.apache.hudi.exception.HoodieTimeTravelException: Time travel's timestamp 
> '20240406123837295' must be earlier than the first incomplete commit 
> timestamp '20240406123834233'.
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_392-internal]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_392-internal]
>         at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392-internal]
>     Caused by: org.apache.hudi.exception.HoodieTimeTravelException: Time 
> travel's timestamp '20240406123837295' must be earlier than the first 
> incomplete commit timestamp '20240406123834233'.
>         at 
> org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf(TimelineUtils.java:369)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1(HoodieBaseRelation.scala:416)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1$adapted(HoodieBaseRelation.scala:416)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at scala.Option.foreach(Option.scala:407) 
> ~[scala-library-2.12.17.jar:?]
>         at 
> org.apache.hudi.HoodieBaseRelation.listLatestFileSlices(HoodieBaseRelation.scala:416)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:225)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:68)
>  ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:369) 
> ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:323)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:357)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:413)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:356)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:323)
>  ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
>         at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
>  ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
>         at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) 
> ~[scala-library-2.12.17.jar:?]
>         at scala.collection.Iterator$$anon$11

[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7812:
--
Description: 
With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 
{code:java}
Failed to execute CLUSTERING service
    java.util.concurrent.CompletionException: 
org.apache.hudi.exception.HoodieTimeTravelException: Time travel's timestamp 
'20240406123837295' must be earlier than the first incomplete commit timestamp 
'20240406123834233'.
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_392-internal]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 ~[?:1.8.0_392-internal]
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
 ~[?:1.8.0_392-internal]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_392-internal]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_392-internal]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392-internal]
    Caused by: org.apache.hudi.exception.HoodieTimeTravelException: Time 
travel's timestamp '20240406123837295' must be earlier than the first 
incomplete commit timestamp '20240406123834233'.
        at 
org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf(TimelineUtils.java:369)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1(HoodieBaseRelation.scala:416)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1$adapted(HoodieBaseRelation.scala:416)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?]
        at 
org.apache.hudi.HoodieBaseRelation.listLatestFileSlices(HoodieBaseRelation.scala:416)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:225)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:68)
 ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:369) 
~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:323)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:357)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:413)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:356)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:323)
 ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
 ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) 
~[scala-library-2.12.17.jar:?]
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) 
~[scala-library-2.12.17.jar:?]
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) 
~[scala-library-2.12.17.jar:?]
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) 
~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
        at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67) 
~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL]
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
 ~[spark-catalyst_2.12-3.2.3.jar:3.2.3]
        at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) 
~[scala-library-2.12.17.jar:?]
        at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) 
~[scala-library-2.12.17.jar:?]
        at scala.collection.Iterator.foreach(Iterator.scala:943) 
~[scala-library-2.12.17.

[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7812:
--
Description: 
With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 

 

 

  was:
With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 


> Async Clustering w/ row writer fails due to timetravel query validation 
> 
>
> Key: HUDI-7812
> URL: https://issues.apache.org/jira/browse/HUDI-7812
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> With clustering row writer enabled flow, we trigger a time travel query to 
> read input records. But the query side fails if there are any pending commits 
> (due to new ingestion ) whose timestamp < clustering instant time. we need to 
> relax this constraint. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7812:
-
Labels: pull-request-available  (was: )

> Async Clustering w/ row writer fails due to timetravel query validation 
> 
>
> Key: HUDI-7812
> URL: https://issues.apache.org/jira/browse/HUDI-7812
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: clustering
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> With clustering row writer enabled flow, we trigger a time travel query to 
> read input records. But the query side fails if there are any pending commits 
> (due to new ingestion ) whose timestamp < clustering instant time. we need to 
> relax this constraint. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation

2024-05-29 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7812:
-

 Summary: Async Clustering w/ row writer fails due to timetravel 
query validation 
 Key: HUDI-7812
 URL: https://issues.apache.org/jira/browse/HUDI-7812
 Project: Apache Hudi
  Issue Type: Bug
  Components: clustering
Reporter: sivabalan narayanan


With clustering row writer enabled flow, we trigger a time travel query to read 
input records. But the query side fails if there are any pending commits (due 
to new ingestion ) whose timestamp < clustering instant time. we need to relax 
this constraint. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path

2024-05-29 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-7811:
-

 Summary: Enhance SparkBaseIndexSupport.getPrunedFileNames to 
return partition path
 Key: HUDI-7811
 URL: https://issues.apache.org/jira/browse/HUDI-7811
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Sagar Sumit
 Fix For: 1.0.0


It will help avoid calling FSUtils.getRelativePartitionPath - 
https://github.com/apache/hudi/pull/11043#discussion_r1611744651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-29 Thread bradley (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bradley updated HUDI-7810:
--
Description: Fixed in PR: [https://github.com/apache/hudi/pull/11359]  
(was: Fix OptionsResolver#allowCommitOnEmptyBatch default value bug)

> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
> -
>
> Key: HUDI-7810
> URL: https://issues.apache.org/jira/browse/HUDI-7810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>
> Fixed in PR: [https://github.com/apache/hudi/pull/11359]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7810:
-
Labels: pull-request-available  (was: )

> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
> -
>
> Key: HUDI-7810
> URL: https://issues.apache.org/jira/browse/HUDI-7810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>
> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-29 Thread bradley (Jira)
bradley created HUDI-7810:
-

 Summary: Fix OptionsResolver#allowCommitOnEmptyBatch default value 
bug
 Key: HUDI-7810
 URL: https://issues.apache.org/jira/browse/HUDI-7810
 Project: Apache Hudi
  Issue Type: Bug
Reporter: bradley


Fix OptionsResolver#allowCommitOnEmptyBatch default value bug



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7808:
-
Labels: pull-request-available  (was: )

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7809:
-
Labels: hoodie-storage pull-request-available  (was: hoodie-storage)

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> With Hudi 0.14.1, without 
> "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi 
> query in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the 
> Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration.
> {code:java}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
>   ... 47 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>   at 
> org.apa

[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Description: 
With Hudi 0.14.1, without 
"spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query 
in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the Kryo 
registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration.
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsR

[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Description: 
With Hudi 0.14.1, without 
"spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query 
in Spark quick start guide succeeds.  In Hudi 0.15.0-rc2, without the Kryo 
registratrar, the 
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRD

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7779:
--
Description: 
Archiving commits from active timeline could lead to data consistency issues on 
some rarest of occasions. We should come up with proper guards to ensure we do 
not make such unintended archival. 

 

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency 
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest 
commit to retain based on latest clean commit metadata. But there are few other 
scenarios that needs to be accounted for. 

 

a. Keeping aside replace commits, lets dive into specifics for regular commits 
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after 
t10, cleaner is supposed to clean up all file versions created at or before t6. 
Say cleaner did not run(for whatever reason for next 5 commits). 

    Archival will certainly be guarded until earliest commit to retain based on 
latest clean commits. 

Corner case to consider: 

A savepoint was added to say t3 and later removed. and still the cleaner was 
never re-enabled. Even though archival would have been stopped at t3 (until 
savepoint is present),but once savepoint is removed, if archival is executed, 
it could archive commit t3. Which means, file versions tracked at t3 is still 
not yet cleaned by the cleaner. 

Reasoning: 

We are good here wrt data consistency. Up until cleaner runs next time, this 
older file versions might be exposed to the end-user. But time travel query is 
not intended for already cleaned up commits and hence this is not an issue. 
None of snapshot, time travel query or incremental query will run into issues 
as they are not supposed to poll for t3. 

At any later point, if cleaner is re-enabled, it will take care of cleaning up 
file versions tracked at t3 commit. Just that for interim period, some older 
file versions might still be exposed to readers. 

 

b. The more tricky part is when replace commits are involved. Since replace 
commit metadata in active timeline is what ensures the replaced file groups are 
ignored for reads, before archiving the same, cleaner is expected to clean them 
up fully. But are there chances that this could go wrong? 

Corner case to consider. Lets add onto above scenario, where t3 has a 
savepoint, and t4 is a replace commit which replaced file groups tracked in t3. 

Cleaner will skip cleaning up files tracked by t3(due to the presence of 
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will 
be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
disabled. In this state of the timeline, if archival is executed, (since 
t3.savepoint is removed), archival might archive t3 and t4.rc.  This could lead 
to data duplicates as both replaced file groups and new file groups from t4.rc 
would be exposed as valid file groups. 

 

In other words, if we were to summarize the different scenarios: 

i. replaced file group is never cleaned up. 
    - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
ii. replaced file group is cleaned up. 
    - ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
clean up did not happen.  After savepoint is removed, and when archival is 
executed, we should avoid archiving the rc of interest. This is the gap we 
don't account for as of now.

 

We have 3 options to go about to solve this.

Option A: 

Let Savepoint deletion flow take care of cleaning up the files its tracking. 

cons:

Savepoint's responsibility is not removing any data files. So, from a single 
user responsibility rule, this may not be right. Also, this clean up might need 
to do what a clean planner might actually be doing. ie. build file system view, 
understand if its supposed to be cleaned up already, and then only clean up the 
files which are supposed to be cleaned up. For eg, if a file group has only one 
file slice, it should not be cleaned up and scenarios like this. 

 

Option B:

Since archival is the one which might cause data consistency issues, why not 
archival do the clean up. 

We need to account for concurrent cleans, failure and retry scenarios etc. 
Also, we might need to build the file system view and then take a call whether 
something needs to be cleaned up before archiving something. 

Cons:

Again, the single user responsibility rule might be broken. Would be neat if 
cleaner takes care of deleting data files and archival only takes care of 
deleting/archiving timeline files. 

 

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
another metadata named "EarliestCommitToArchive". Strictly speaking, ear

[jira] [Updated] (HUDI-7655) Support configuration for clean to fail execution if there is at least one file is marked as a failed delete

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7655:
--
Fix Version/s: 1.0.0

> Support configuration for clean to fail execution if there is at least one 
> file is marked as a failed delete
> 
>
> Key: HUDI-7655
> URL: https://issues.apache.org/jira/browse/HUDI-7655
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: clean, pull-request-available
> Fix For: 1.0.0
>
>
> When a HUDI clean plan is executed, any targeted file that was not confirmed 
> as deleted (or non-existing) will be marked as a "failed delete". Although 
> these failed deletes will be added to `.clean` metadata, if incremental clean 
> is used then these files might not ever be picked up again as a future clean 
> plan, unless a "full-scan" clean ends up being scheduled. In addition to 
> leading to more files unnecessarily taking up storage space for longer, then 
> can lead to the following dataset consistency issue for COW datasets:
>  # Insert at C1 creates file group f1 in partition
>  # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
>  # Any reader of partition that calls HUDI API (with or without using MDT) 
> will recognize that f1 should be ignored, as it has been replaced. This is 
> since RC2 instant file is in active timeline
>  # Some completed instants later an incremental clean is scheduled. It moves 
> the "earliest commit to retain" to an time after instant time RC2, so it 
> targets f1 for deletion. But during execution of the plan, it fails to delete 
> f1.
>  # An archive job eventually is triggered, and archives C1 and RC2. Note that 
> f1 is still in partition
> At this point, any job/query that reads the aforementioned partition directly 
> from the DFS file system calls (without directly using MDT FILES partition) 
> will consider both f1 and f2 as valid file groups, since RC2 is no longer in 
> active timeline. This is a data consistency issue, and will only be resolved 
> if a "full-scan" clean is triggered and deletes f1.
> This specific scenario can be avoided if the user can configure HUDI clean to 
> fail execution of a clean plan unless all files are confirmed as deleted (or 
> not existing in DFS already), "blocking" the clean. The next clean attempt 
> will re-execute this existing plan, since clean plans cannot be "rolled 
> back". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Description: 
With 0.14
{code:java}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
  at scala.Option.foreach(Option.scala:407)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
  at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90

[jira] [Assigned] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7809:
---

Assignee: Ethan Guo

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7809:
---

 Summary: Use Spark SerializableConfiguration to avoid NPE in Kryo 
serde
 Key: HUDI-7809
 URL: https://issues.apache.org/jira/browse/HUDI-7809
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Fix Version/s: 0.15.0
   1.0.0

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7809:

Labels: hoodie-storage  (was: )

> Use Spark SerializableConfiguration to avoid NPE in Kryo serde
> --
>
> Key: HUDI-7809
> URL: https://issues.apache.org/jira/browse/HUDI-7809
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5505) Compaction NUM_COMMITS policy should only judge completed deltacommit

2024-05-28 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5505.

Fix Version/s: 1.0.0
Reviewers: Danny Chen
   Resolution: Fixed

Fixed via master branch: 42243862f0271fda16e70afdbfde61b47792ff70

> Compaction NUM_COMMITS policy should only judge completed deltacommit
> -
>
> Key: HUDI-5505
> URL: https://issues.apache.org/jira/browse/HUDI-5505
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, table-service
>Reporter: HunterXHunter
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: image-2023-01-05-13-10-57-918.png
>
>
> `compaction.delta_commits =1`
>  
> {code:java}
> 20230105115229301.deltacommit
> 20230105115229301.deltacommit.inflight
> 20230105115229301.deltacommit.requested
> 20230105115253118.commit
> 20230105115253118.compaction.inflight
> 20230105115253118.compaction.requested
> 20230105115330994.deltacommit.inflight
> 20230105115330994.deltacommit.requested{code}
> The return result of `ScheduleCompactionActionExecutor.needCompact ` is 
> `true`, 
> This should not be expected.
>  
> And In the `Occ` or `lazy clean` mode,this will cause compaction trigger 
> early.
> `compaction.delta_commits =3`
>  
> {code:java}
> 20230105125650541.deltacommit.inflight
> 20230105125650541.deltacommit.requested
> 20230105125715081.deltacommit
> 20230105125715081.deltacommit.inflight
> 20230105125715081.deltacommit.requested
> 20230105130018070.deltacommit.inflight
> 20230105130018070.deltacommit.requested {code}
>  
> And compaction will be trigger, this should not be expected.
> !image-2023-01-05-13-10-57-918.png|width=699,height=158!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source

2024-05-28 Thread Davis Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang reassigned HUDI-7806:
-

Assignee: Davis Zhang

> Skip fail on data-loss for first commit on Kafka Source
> ---
>
> Key: HUDI-7806
> URL: https://issues.apache.org/jira/browse/HUDI-7806
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Davis Zhang
>Assignee: Davis Zhang
>Priority: Major
>
> When the ingestion attempts to start from the beginning of the topic, we 
> should not fail on data loss since topic retention can cause failures when 
> some data is removed before our ingestion is able to fully read the offsets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7808:
---

 Summary: Security upgrade io.acryl:datahub-client from 0.8.31 to 
0.8.45
 Key: HUDI-7808
 URL: https://issues.apache.org/jira/browse/HUDI-7808
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7808:

Fix Version/s: 1.0.0

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45

2024-05-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7808:
---

Assignee: Ethan Guo

> Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
> --
>
> Key: HUDI-7808
> URL: https://issues.apache.org/jira/browse/HUDI-7808
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7807:
-
Labels: pull-request-available  (was: )

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNod

[jira] [Created] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7807:
-

 Summary: spark-sql updates for a pk less table fails w/ 
partitioned table 
 Key: HUDI-7807
 URL: https://issues.apache.org/jira/browse/HUDI-7807
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: sivabalan narayanan


quick start fails when trying to UPDATE with spark-sql for a pk less table. 

 
{code:java}
         > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET fare 
= 25.0 WHERE rider = 'rider-D']
org.apache.hudi.exception.HoodieException: Unable to instantiate class 
org.apache.hudi.keygen.SimpleKeyGenerator
at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
at 
org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
at 
org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
at 
org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:

[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7807:
--
Fix Version/s: 0.15.0
   1.0.0

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.exe

[jira] [Assigned] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table

2024-05-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7807:
-

Assignee: sivabalan narayanan

> spark-sql updates for a pk less table fails w/ partitioned table 
> -
>
> Key: HUDI-7807
> URL: https://issues.apache.org/jira/browse/HUDI-7807
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> quick start fails when trying to UPDATE with spark-sql for a pk less table. 
>  
> {code:java}
>          > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D';
> 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET 
> fare = 25.0 WHERE rider = 'rider-D']
> org.apache.hudi.exception.HoodieException: Unable to instantiate class 
> org.apache.hudi.keygen.SimpleKeyGenerator
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75)
>   at 
> org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123)
>   at 
> org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91)
>   at 
> org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47)
>   at 
> org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
>   at 
> org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyE

[jira] [Created] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source

2024-05-28 Thread Davis Zhang (Jira)
Davis Zhang created HUDI-7806:
-

 Summary: Skip fail on data-loss for first commit on Kafka Source
 Key: HUDI-7806
 URL: https://issues.apache.org/jira/browse/HUDI-7806
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Davis Zhang


When the ingestion attempts to start from the beginning of the topic, we should 
not fail on data loss since topic retention can cause failures when some data 
is removed before our ingestion is able to fully read the offsets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7805) FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7805:
-
Labels: pull-request-available  (was: )

> FileSystemBasedLockProvider need delete lock file auto when occur lock 
> conflict to avoid next write failed
> --
>
> Key: HUDI-7805
> URL: https://issues.apache.org/jira/browse/HUDI-7805
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>
> org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock 
> object hdfs://aa-region/region04/2211/warehouse/hudi/odsmon_log/.hoodie/lock
>   at 
> org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:100)
>   at 
> org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:58)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1258)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1301)
>   at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:139)
>   at 
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:216)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:396)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:108)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:61)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:219)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSess

[jira] [Created] (HUDI-7805) FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed

2024-05-28 Thread xy (Jira)
xy created HUDI-7805:


 Summary: FileSystemBasedLockProvider need delete lock file auto 
when occur lock conflict to avoid next write failed
 Key: HUDI-7805
 URL: https://issues.apache.org/jira/browse/HUDI-7805
 Project: Apache Hudi
  Issue Type: Improvement
  Components: multi-writer
Reporter: xy
Assignee: xy


org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock 
object hdfs://aa-region/region04/2211/warehouse/hudi/odsmon_log/.hoodie/lock
at 
org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:100)
at 
org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:58)
at 
org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1258)
at 
org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1301)
at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:139)
at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:216)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:396)
at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:108)
at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:61)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
at org.apache.spark.sql.Dataset.(Dataset.scala:219)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
at 
com.vivo.bigdata.etl.process.EtlProcessMain$.main(EtlProcessMain.scala:367)
at 
com.vivo.bigdata.etl.process.EtlProcessMain.main(EtlProcessMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62

[jira] [Updated] (HUDI-7804) Improve flink bucket index partitioner

2024-05-28 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7804:
-
Sprint: Sprint 2023-04-26

> Improve flink bucket index partitioner
> --
>
> Key: HUDI-7804
> URL: https://issues.apache.org/jira/browse/HUDI-7804
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: xi chaomin
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/11288



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7804) Improve flink bucket index partitioner

2024-05-28 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen reassigned HUDI-7804:


Assignee: Danny Chen

> Improve flink bucket index partitioner
> --
>
> Key: HUDI-7804
> URL: https://issues.apache.org/jira/browse/HUDI-7804
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: xi chaomin
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/11288



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7804) Improve flink bucket index partitioner

2024-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7804:
-
Labels: pull-request-available  (was: )

> Improve flink bucket index partitioner
> --
>
> Key: HUDI-7804
> URL: https://issues.apache.org/jira/browse/HUDI-7804
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: xi chaomin
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/11288



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7804) Improve flink bucket index partitioner

2024-05-28 Thread xi chaomin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xi chaomin updated HUDI-7804:
-
Description: https://github.com/apache/hudi/issues/11288

> Improve flink bucket index partitioner
> --
>
> Key: HUDI-7804
> URL: https://issues.apache.org/jira/browse/HUDI-7804
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: xi chaomin
>Priority: Major
>
> https://github.com/apache/hudi/issues/11288



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7804) Improve flink bucket index partitioner

2024-05-28 Thread xi chaomin (Jira)
xi chaomin created HUDI-7804:


 Summary: Improve flink bucket index partitioner
 Key: HUDI-7804
 URL: https://issues.apache.org/jira/browse/HUDI-7804
 Project: Apache Hudi
  Issue Type: Bug
Reporter: xi chaomin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7795) Fix loading of input splits from look up table reader

2024-05-27 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7795:
-
Status: Patch Available  (was: In Progress)

> Fix loading of input splits from look up table reader
> -
>
> Key: HUDI-7795
> URL: https://issues.apache.org/jira/browse/HUDI-7795
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7795) Fix loading of input splits from look up table reader

2024-05-27 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7795:
-
Status: In Progress  (was: Open)

> Fix loading of input splits from look up table reader
> -
>
> Key: HUDI-7795
> URL: https://issues.apache.org/jira/browse/HUDI-7795
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-05-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7507:
-
Labels: pull-request-available  (was: )

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> *Scenarios:*
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
>  ** If the completed commit files include som sort of "checkpointing" with 
> another "downstream job" performing incremental reads on this dataset (such 
> as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as 
> the incremental reader skipping some completed commits (that have a smaller 
> instant timestamp than latest completed commit but were created after).
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
> *Proposed approach:*
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
> Approach A
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately (A) has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan even 
> if it's an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this and would 
> require deprecating those APIs.
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline that are greater than it that could cause a conflict. If that 
> assertion fails, then throw a retry-able conflict resolution exception.
> Specifically, the following steps should be followed whenever any instant 
> (commit, table service, etc) is scheduled
> Approach B
>  # Acquire table lock. Assume that the desired instant time C and requested 
> file plan metadata have already been created, regardless of wether it was 
> before this step or right after acquiring the table lock.
>  # If there are any instants on the timeline that are greater than C 
> (regardless of their operation type or sate status) then release table lock 
> and throw an exception
>  # Create requested plan on timeline (As usual)
>  # Release table lock
> Unlike (A), thi

[jira] [Updated] (HUDI-7795) Fix loading of input splits from look up table reader

2024-05-27 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7795:
-
Fix Version/s: 1.0.0

> Fix loading of input splits from look up table reader
> -
>
> Key: HUDI-7795
> URL: https://issues.apache.org/jira/browse/HUDI-7795
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >