[jira] [Closed] (HUDI-7007) Integrate functional index using bloom filter on reader side
[ https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7007. - Resolution: Done > Integrate functional index using bloom filter on reader side > > > Key: HUDI-7007 > URL: https://issues.apache.org/jira/browse/HUDI-7007 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Currently, one can create a functional index on a column using bloom filters. > However, only the one created using column stats is supported on the reader > side (check `FunctionalIndexSupport`). This ticket tracks the support for > using bloom filters on functional index in the reader path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7825) Support Report pending clustering and compaction plan metric
[ https://issues.apache.org/jira/browse/HUDI-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7825: - Labels: pull-request-available (was: ) > Support Report pending clustering and compaction plan metric > - > > Key: HUDI-7825 > URL: https://issues.apache.org/jira/browse/HUDI-7825 > Project: Apache Hudi > Issue Type: Bug >Reporter: jack Lei >Priority: Major > Labels: pull-request-available > > 1、when just async clustering or async compaction schedule enable, and > clustering.async.enabled or compaction.async.enabled set false, then the > flink job will not add clusterPlanOperator or CompactionPlanOperator > 2、 but the pending plan metric emit in clusterPlanOperator or > CompactionPlanOperator > 3、so maybe support emit pending plan metric in StreamWriteOperatorCoordinator -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7825) Support Report pending clustering and compaction plan metric
jack Lei created HUDI-7825: -- Summary: Support Report pending clustering and compaction plan metric Key: HUDI-7825 URL: https://issues.apache.org/jira/browse/HUDI-7825 Project: Apache Hudi Issue Type: Bug Reporter: jack Lei 1、when just async clustering or async compaction schedule enable, and clustering.async.enabled or compaction.async.enabled set false, then the flink job will not add clusterPlanOperator or CompactionPlanOperator 2、 but the pending plan metric emit in clusterPlanOperator or CompactionPlanOperator 3、so maybe support emit pending plan metric in StreamWriteOperatorCoordinator -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner
[ https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7824: - Assignee: sivabalan narayanan > Fix incremental partitions fetch logic when savepoint is removed for Incr > cleaner > - > > Key: HUDI-7824 > URL: https://issues.apache.org/jira/browse/HUDI-7824 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > with incremental cleaner, if a savepoint is blocking cleaning up of a commit > and cleaner moved ahead wrt earliest commit to retain, when savepoint is > removed later, cleaner should account for cleaning up the commit of interest. > > Lets ensure clean planner account for all partitions when such savepoint > removal is detected -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner
[ https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7824: - Labels: pull-request-available (was: ) > Fix incremental partitions fetch logic when savepoint is removed for Incr > cleaner > - > > Key: HUDI-7824 > URL: https://issues.apache.org/jira/browse/HUDI-7824 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > with incremental cleaner, if a savepoint is blocking cleaning up of a commit > and cleaner moved ahead wrt earliest commit to retain, when savepoint is > removed later, cleaner should account for cleaning up the commit of interest. > > Lets ensure clean planner account for all partitions when such savepoint > removal is detected -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner
sivabalan narayanan created HUDI-7824: - Summary: Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner Key: HUDI-7824 URL: https://issues.apache.org/jira/browse/HUDI-7824 Project: Apache Hudi Issue Type: Bug Components: cleaning Reporter: sivabalan narayanan with incremental cleaner, if a savepoint is blocking cleaning up of a commit and cleaner moved ahead wrt earliest commit to retain, when savepoint is removed later, cleaner should account for cleaning up the commit of interest. Lets ensure clean planner account for all partitions when such savepoint removal is detected -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7823) Simplify dependency management on exclusions
[ https://issues.apache.org/jira/browse/HUDI-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7823: - Labels: pull-request-available (was: ) > Simplify dependency management on exclusions > > > Key: HUDI-7823 > URL: https://issues.apache.org/jira/browse/HUDI-7823 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7823) Simplify dependency management on exclusions
Ethan Guo created HUDI-7823: --- Summary: Simplify dependency management on exclusions Key: HUDI-7823 URL: https://issues.apache.org/jira/browse/HUDI-7823 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests
[ https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7822: - Labels: pull-request-available (was: ) > Resolve the conflicts between mixed hdfs and local path in Flink tests > -- > > Key: HUDI-7822 > URL: https://issues.apache.org/jira/browse/HUDI-7822 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests
[ https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851205#comment-17851205 ] Ethan Guo commented on HUDI-7822: - https://github.com/apache/hudi/pull/10931 > Resolve the conflicts between mixed hdfs and local path in Flink tests > -- > > Key: HUDI-7822 > URL: https://issues.apache.org/jira/browse/HUDI-7822 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests
Ethan Guo created HUDI-7822: --- Summary: Resolve the conflicts between mixed hdfs and local path in Flink tests Key: HUDI-7822 URL: https://issues.apache.org/jira/browse/HUDI-7822 Project: Apache Hudi Issue Type: Bug Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests
[ https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7822: Fix Version/s: 1.0.0 > Resolve the conflicts between mixed hdfs and local path in Flink tests > -- > > Key: HUDI-7822 > URL: https://issues.apache.org/jira/browse/HUDI-7822 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7821) Handle schema evolution in proto to avro conversion
[ https://issues.apache.org/jira/browse/HUDI-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7821: - Labels: pull-request-available (was: ) > Handle schema evolution in proto to avro conversion > --- > > Key: HUDI-7821 > URL: https://issues.apache.org/jira/browse/HUDI-7821 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Priority: Major > Labels: pull-request-available > > Users can encounter errors when a batch of data was written with an older > schema and a new schema has fields that are not present in the old data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7821) Handle schema evolution in proto to avro conversion
Timothy Brown created HUDI-7821: --- Summary: Handle schema evolution in proto to avro conversion Key: HUDI-7821 URL: https://issues.apache.org/jira/browse/HUDI-7821 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown Users can encounter errors when a batch of data was written with an older schema and a new schema has fields that are not present in the old data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path
[ https://issues.apache.org/jira/browse/HUDI-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7811. - Resolution: Fixed Fixed in the original PR itself - https://github.com/apache/hudi/pull/11043#discussion_r1621825753 > Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path > - > > Key: HUDI-7811 > URL: https://issues.apache.org/jira/browse/HUDI-7811 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > It will help avoid calling FSUtils.getRelativePartitionPath - > https://github.com/apache/hudi/pull/11043#discussion_r1611744651 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path
[ https://issues.apache.org/jira/browse/HUDI-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-7811: - Assignee: Sagar Sumit > Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path > - > > Key: HUDI-7811 > URL: https://issues.apache.org/jira/browse/HUDI-7811 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > It will help avoid calling FSUtils.getRelativePartitionPath - > https://github.com/apache/hudi/pull/11043#discussion_r1611744651 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7820) For bloom index reader path, prune based on min/max if colstats is enabled
Sagar Sumit created HUDI-7820: - Summary: For bloom index reader path, prune based on min/max if colstats is enabled Key: HUDI-7820 URL: https://issues.apache.org/jira/browse/HUDI-7820 Project: Apache Hudi Issue Type: Improvement Reporter: Sagar Sumit Fix For: 1.1.0, 1.0.0 Bloom filters can result in false positives. We can try to prune files based on min/max if colstats is available for the field. https://github.com/apache/hudi/pull/11043#discussion_r1621639791 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
[ https://issues.apache.org/jira/browse/HUDI-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7819: - Labels: pull-request-available (was: ) > Fix OptionsResolver#allowCommitOnEmptyBatch default value bug > - > > Key: HUDI-7819 > URL: https://issues.apache.org/jira/browse/HUDI-7819 > Project: Apache Hudi > Issue Type: Bug >Reporter: bradley >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
[ https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bradley closed HUDI-7810. - Resolution: Later > Fix OptionsResolver#allowCommitOnEmptyBatch default value bug > - > > Key: HUDI-7810 > URL: https://issues.apache.org/jira/browse/HUDI-7810 > Project: Apache Hudi > Issue Type: Bug >Reporter: bradley >Priority: Major > Labels: pull-request-available > > Fixed in PR: [https://github.com/apache/hudi/pull/11359] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
bradley created HUDI-7819: - Summary: Fix OptionsResolver#allowCommitOnEmptyBatch default value bug Key: HUDI-7819 URL: https://issues.apache.org/jira/browse/HUDI-7819 Project: Apache Hudi Issue Type: Bug Reporter: bradley -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7818) Flink Table planner not loading problem
[ https://issues.apache.org/jira/browse/HUDI-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7818: - Sprint: Sprint 2023-04-26 > Flink Table planner not loading problem > --- > > Key: HUDI-7818 > URL: https://issues.apache.org/jira/browse/HUDI-7818 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7818) Flink Table planner not loading problem
Danny Chen created HUDI-7818: Summary: Flink Table planner not loading problem Key: HUDI-7818 URL: https://issues.apache.org/jira/browse/HUDI-7818 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: Danny Chen Assignee: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7817: - Labels: pull-request-available (was: ) > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of Jackson Core > (com.fasterxml.jackson.core:jackson-core). > org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which > should be avoided. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Description: org.codehaus.jackson is a older version of Jackson Core (com.fasterxml.jackson.core:jackson-core). org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which should be avoided. (was: org.codehaus.jackson is a older version of Jackson Core (com.fasterxml.jackson.core:jackson-core). org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which should be avoid.) > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of Jackson Core > (com.fasterxml.jackson.core:jackson-core). > org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which > should be avoided. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Description: org.codehaus.jackson is a older version of Jackson Core (com.fasterxml.jackson.core:jackson-core). org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which should be avoid. (was: org.codehaus.jackson is a older version of ) > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of Jackson Core > (com.fasterxml.jackson.core:jackson-core). > org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which > should be avoid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Description: org.codehaus.jackson is a older version of > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > org.codehaus.jackson is a older version of -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7817: --- Assignee: Ethan Guo > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
[ https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7817: Fix Version/s: 1.0.0 > Use Jackson Core instead of org.codehaus.jackson for JSON encoding > -- > > Key: HUDI-7817 > URL: https://issues.apache.org/jira/browse/HUDI-7817 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding
Ethan Guo created HUDI-7817: --- Summary: Use Jackson Core instead of org.codehaus.jackson for JSON encoding Key: HUDI-7817 URL: https://issues.apache.org/jira/browse/HUDI-7817 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7816) Pass the source profile to the snapshot query splitter
[ https://issues.apache.org/jira/browse/HUDI-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7816: - Labels: pull-request-available (was: ) > Pass the source profile to the snapshot query splitter > -- > > Key: HUDI-7816 > URL: https://issues.apache.org/jira/browse/HUDI-7816 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Rajesh Mahindra >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7816) Pass the source profile to the snapshot query splitter
Rajesh Mahindra created HUDI-7816: - Summary: Pass the source profile to the snapshot query splitter Key: HUDI-7816 URL: https://issues.apache.org/jira/browse/HUDI-7816 Project: Apache Hudi Issue Type: Improvement Reporter: Rajesh Mahindra -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7779: -- Description: Archiving commits from active timeline could lead to data consistency issues on some rarest of occasions. We should come up with proper guards to ensure we do not make such unintended archival. Major gap which we wanted to guard is: if someone disabled cleaner, archival should account for data consistency issues and ensure it bails out. We have a base guarding condition, where archival will stop at the earliest commit to retain based on latest clean commit metadata. But there are few other scenarios that needs to be accounted for. a. Keeping aside replace commits, lets dive into specifics for regular commits and delta commits. Say user configured clean commits to 4 and archival configs to 5 and 6. after t10, cleaner is supposed to clean up all file versions created at or before t6. Say cleaner did not run(for whatever reason for next 5 commits). Archival will certainly be guarded until earliest commit to retain based on latest clean commits. Corner case to consider: A savepoint was added to say t3 and later removed. and still the cleaner was never re-enabled. Even though archival would have been stopped at t3 (until savepoint is present),but once savepoint is removed, if archival is executed, it could archive commit t3. Which means, file versions tracked at t3 is still not yet cleaned by the cleaner. Reasoning: We are good here wrt data consistency. Up until cleaner runs next time, this older file versions might be exposed to the end-user. But time travel query is not intended for already cleaned up commits and hence this is not an issue. None of snapshot, time travel query or incremental query will run into issues as they are not supposed to poll for t3. At any later point, if cleaner is re-enabled, it will take care of cleaning up file versions tracked at t3 commit. Just that for interim period, some older file versions might still be exposed to readers. b. The more tricky part is when replace commits are involved. Since replace commit metadata in active timeline is what ensures the replaced file groups are ignored for reads, before archiving the same, cleaner is expected to clean them up fully. But are there chances that this could go wrong? Corner case to consider. Lets add onto above scenario, where t3 has a savepoint, and t4 is a replace commit which replaced file groups tracked in t3. Cleaner will skip cleaning up files tracked by t3(due to the presence of savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will be pointing to t6. And say savepoint for t3 is removed, but cleaner was disabled. In this state of the timeline, if archival is executed, (since t3.savepoint is removed), archival might archive t3 and t4.rc. This could lead to data duplicates as both replaced file groups and new file groups from t4.rc would be exposed as valid file groups. In other words, if we were to summarize the different scenarios: i. replaced file group is never cleaned up. - ECTR(Earliest commit to retain) is less than this.rc and we are good. ii. replaced file group is cleaned up. - ECTR is > this.rc and is good to archive. iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full clean up did not happen. After savepoint is removed, and when archival is executed, we should avoid archiving the rc of interest. This is the gap we don't account for as of now. We have 3 options to go about to solve this. Option A: Let Savepoint deletion flow take care of cleaning up the files its tracking. cons: Savepoint's responsibility is not removing any data files. So, from a single user responsibility rule, this may not be right. Also, this clean up might need to do what a clean planner might actually be doing. ie. build file system view, understand if its supposed to be cleaned up already, and then only clean up the files which are supposed to be cleaned up. For eg, if a file group has only one file slice, it should not be cleaned up and scenarios like this. Option B: Since archival is the one which might cause data consistency issues, why not archival do the clean up. We need to account for concurrent cleans, failure and retry scenarios etc. Also, we might need to build the file system view and then take a call whether something needs to be cleaned up before archiving something. Cons: Again, the single user responsibility rule might be broken. Would be neat if cleaner takes care of deleting data files and archival only takes care of deleting/archiving timeline files. Option C: Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track another metadata named "EarliestCommitToArchive". Strictly speaking, ear
[jira] [Closed] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs
[ https://issues.apache.org/jira/browse/HUDI-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7407. - Resolution: Fixed > Add optional clean support to standalone compaction and clustering jobs > --- > > Key: HUDI-7407 > URL: https://issues.apache.org/jira/browse/HUDI-7407 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Lets add top level config to standalone compaction and clustering job to > optionally clean. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline
[ https://issues.apache.org/jira/browse/HUDI-7815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7815: - Labels: pull-request-available (was: ) > Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh > timeline > > > Key: HUDI-7815 > URL: https://issues.apache.org/jira/browse/HUDI-7815 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline
xy created HUDI-7815: Summary: Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline Key: HUDI-7815 URL: https://issues.apache.org/jira/browse/HUDI-7815 Project: Apache Hudi Issue Type: Improvement Components: spark-sql Reporter: xy Assignee: xy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
[ https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7807: Sprint: Sprint 2023-04-26 > spark-sql updates for a pk less table fails w/ partitioned table > - > > Key: HUDI-7807 > URL: https://issues.apache.org/jira/browse/HUDI-7807 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > quick start fails when trying to UPDATE with spark-sql for a pk less table. > > {code:java} > > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; > 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET > fare = 25.0 WHERE rider = 'rider-D'] > org.apache.hudi.exception.HoodieException: Unable to instantiate class > org.apache.hudi.keygen.SimpleKeyGenerator > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) > at > org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) > at > org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.
[jira] [Updated] (HUDI-7791) Bump h2 from 1.4.200 to 2.2.220 in /packaging/hudi-metaserver-server-bundle
[ https://issues.apache.org/jira/browse/HUDI-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7791: Sprint: Sprint 2023-04-26 > Bump h2 from 1.4.200 to 2.2.220 in /packaging/hudi-metaserver-server-bundle > --- > > Key: HUDI-7791 > URL: https://issues.apache.org/jira/browse/HUDI-7791 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7796) Gracefully cast file system instance in Avro writers
[ https://issues.apache.org/jira/browse/HUDI-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7796: Sprint: Sprint 2023-04-26 > Gracefully cast file system instance in Avro writers > > > Key: HUDI-7796 > URL: https://issues.apache.org/jira/browse/HUDI-7796 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > When running tests in Trino with Hudi MDT enabled, the following line in > HoodieAvroHFileWriter throws class cast exception, since Trino uses > dependency injection to provide the Hadoop file system instance, which may > skip the Hudi wrapper file system logic. > {code:java} > this.fs = (HoodieWrapperFileSystem) this.file.getFileSystem(conf); {code} > {code:java} > Caused by: java.lang.ClassCastException: class > io.trino.hdfs.TrinoFileSystemCache$FileSystemWrapper cannot be cast to class > org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem > (io.trino.hdfs.TrinoFileSystemCache$FileSystemWrapper and > org.apache.hudi.hadoop.fs.HoodieWrapperFileSystem are in unnamed module of > loader 'app') > at > org.apache.hudi.io.hadoop.HoodieAvroHFileWriter.(HoodieAvroHFileWriter.java:91) > at > org.apache.hudi.io.hadoop.HoodieAvroFileWriterFactory.newHFileFileWriter(HoodieAvroFileWriterFactory.java:108) > at > org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriterByFormat(HoodieFileWriterFactory.java:70) > at > org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriter(HoodieFileWriterFactory.java:53) > at > org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:108) > at > org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:77) > at > org.apache.hudi.io.CreateHandleFactory.create(CreateHandleFactory.java:45) > at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:101) > at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:44) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7801) Directly pass down HoodieStorage instance instead of recreation
[ https://issues.apache.org/jira/browse/HUDI-7801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7801: Sprint: Sprint 2023-04-26 > Directly pass down HoodieStorage instance instead of recreation > --- > > Key: HUDI-7801 > URL: https://issues.apache.org/jira/browse/HUDI-7801 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > There are places that use HoodieStorage#newInstance to recreate HoodieStorage > instance which may not be necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
[ https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7808: Sprint: Sprint 2023-04-26 > Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 > -- > > Key: HUDI-7808 > URL: https://issues.apache.org/jira/browse/HUDI-7808 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7769) Fix Hudi CDC read with legacy parquet file format on Spark
[ https://issues.apache.org/jira/browse/HUDI-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7769: Sprint: Sprint 2023-04-26 > Fix Hudi CDC read with legacy parquet file format on Spark > -- > > Key: HUDI-7769 > URL: https://issues.apache.org/jira/browse/HUDI-7769 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Sprint: Sprint 2023-04-26 > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > With Hudi 0.14.1, without > "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi > query in Spark quick start guide succeeds. In Hudi 0.15.0-rc2, without the > Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration. > {code:java} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:806) > at org.apache.spark.sql.Dataset.show(Dataset.scala:765) > at org.apache.spark.sql.Dataset.show(Dataset.scala:774) > ... 47 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) > at > org.apache.spark.sql.execution.datasource
[jira] [Updated] (HUDI-7790) Revert changes in DFSPathSelector and UtilHelpers.readConfig
[ https://issues.apache.org/jira/browse/HUDI-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7790: Sprint: Sprint 2023-04-26 > Revert changes in DFSPathSelector and UtilHelpers.readConfig > > > Key: HUDI-7790 > URL: https://issues.apache.org/jira/browse/HUDI-7790 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > This is to avoid behavior changes in DFSPathSelector and keep the > UtilHelpers.readConfig API the same as before. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7792) Bump h2 from 1.4.200 to 2.2.220 in /hudi-platform-service/hudi-metaserver
[ https://issues.apache.org/jira/browse/HUDI-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7792: Sprint: Sprint 2023-04-26 > Bump h2 from 1.4.200 to 2.2.220 in /hudi-platform-service/hudi-metaserver > - > > Key: HUDI-7792 > URL: https://issues.apache.org/jira/browse/HUDI-7792 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7784) Fix serde of HoodieHadoopConfiguration in Spark
[ https://issues.apache.org/jira/browse/HUDI-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7784: Sprint: Sprint 2023-04-26 > Fix serde of HoodieHadoopConfiguration in Spark > --- > > Key: HUDI-7784 > URL: https://issues.apache.org/jira/browse/HUDI-7784 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7785) Keep public APIs in utilities module the same as before HoodieStorage abstraction
[ https://issues.apache.org/jira/browse/HUDI-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7785: Sprint: Sprint 2023-04-26 > Keep public APIs in utilities module the same as before HoodieStorage > abstraction > - > > Key: HUDI-7785 > URL: https://issues.apache.org/jira/browse/HUDI-7785 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > BaseErrorTableWriter, HoodieStreamer, StreamSync, etc., are public API > classes and contain public API methods, which should be kept the same as > before. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7794) Bump org.apache.hive:hive-service from 2.3.1 to 2.3.4
[ https://issues.apache.org/jira/browse/HUDI-7794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7794: Sprint: Sprint 2023-04-26 > Bump org.apache.hive:hive-service from 2.3.1 to 2.3.4 > - > > Key: HUDI-7794 > URL: https://issues.apache.org/jira/browse/HUDI-7794 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7798) Mark configs included in 0.15.0 release
[ https://issues.apache.org/jira/browse/HUDI-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7798: Sprint: Sprint 2023-04-26 > Mark configs included in 0.15.0 release > --- > > Key: HUDI-7798 > URL: https://issues.apache.org/jira/browse/HUDI-7798 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > We need to mark the configs that go out in 0.15.0 release with > `.sinceVersion("0.15.0")`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7802) Fix bundle validation scripts
[ https://issues.apache.org/jira/browse/HUDI-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7802: Sprint: Sprint 2023-04-26 > Fix bundle validation scripts > - > > Key: HUDI-7802 > URL: https://issues.apache.org/jira/browse/HUDI-7802 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Issues: > * Bundle validation with packaging/bundle-validation/ci_run.sh fails for > release-0.15.0 branch due to script issue > * scripts/release/validate_staged_bundles.sh needs to include additional > bundles. > * Add release candidate validation on scala 2.13 bundles. > * Disable release candidate validation by default. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities
[ https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7814: Sprint: Sprint 2023-04-26 > Exclude unused transitive dependencies that introduce vulnerabilities > - > > Key: HUDI-7814 > URL: https://issues.apache.org/jira/browse/HUDI-7814 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.16.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7786) Fix roaring bitmap dependency in hudi-integ-test-bundle
[ https://issues.apache.org/jira/browse/HUDI-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7786: Sprint: Sprint 2023-04-26 > Fix roaring bitmap dependency in hudi-integ-test-bundle > --- > > Key: HUDI-7786 > URL: https://issues.apache.org/jira/browse/HUDI-7786 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7788) Fixing exception handling in AverageRecordSizeUtils
[ https://issues.apache.org/jira/browse/HUDI-7788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7788: Sprint: Sprint 2023-04-26 > Fixing exception handling in AverageRecordSizeUtils > --- > > Key: HUDI-7788 > URL: https://issues.apache.org/jira/browse/HUDI-7788 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > We should catch Throwable to avoid any issue during record size estimation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7777) Allow HoodieTableMetaClient to take HoodieStorage instance directly
[ https://issues.apache.org/jira/browse/HUDI-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-: Sprint: Sprint 2023-04-26 > Allow HoodieTableMetaClient to take HoodieStorage instance directly > > > Key: HUDI- > URL: https://issues.apache.org/jira/browse/HUDI- > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > We need to functionality for the meta client to -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities
[ https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7814: Fix Version/s: 1.0.0 0.16.0 > Exclude unused transitive dependencies that introduce vulnerabilities > - > > Key: HUDI-7814 > URL: https://issues.apache.org/jira/browse/HUDI-7814 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.16.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities
[ https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7814: --- Assignee: Ethan Guo > Exclude unused transitive dependencies that introduce vulnerabilities > - > > Key: HUDI-7814 > URL: https://issues.apache.org/jira/browse/HUDI-7814 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities
[ https://issues.apache.org/jira/browse/HUDI-7814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7814: - Labels: pull-request-available (was: ) > Exclude unused transitive dependencies that introduce vulnerabilities > - > > Key: HUDI-7814 > URL: https://issues.apache.org/jira/browse/HUDI-7814 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7814) Exclude unused transitive dependencies that introduce vulnerabilities
Ethan Guo created HUDI-7814: --- Summary: Exclude unused transitive dependencies that introduce vulnerabilities Key: HUDI-7814 URL: https://issues.apache.org/jira/browse/HUDI-7814 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7211) Relax need of ordering/precombine field for tables with autogenerated record keys for DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850517#comment-17850517 ] sivabalan narayanan commented on HUDI-7211: --- For auto record key gen, you need to set operation type to "INSERT". Can you give that a try. > Relax need of ordering/precombine field for tables with autogenerated record > keys for DeltaStreamer > --- > > Key: HUDI-7211 > URL: https://issues.apache.org/jira/browse/HUDI-7211 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Aditya Goenka >Priority: Critical > Fix For: 1.1.0 > > > [https://github.com/apache/hudi/issues/10233] > > ``` > NOW=$(date '+%Y%m%dt%H%M%S') > ${SPARK_HOME}/bin/spark-submit \ > --jars > ${path_prefix}/jars/${SPARK_V}/hudi-spark${SPARK_VERSION}-bundle_2.12-${HUDI_VERSION}.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > ${path_prefix}/jars/${SPARK_V}/hudi-utilities-slim-bundle_2.12-${HUDI_VERSION}.jar > \ > --target-base-path ${path_prefix}/testcases/stocks/data/target/${NOW} \ > --target-table stocks${NOW} \ > --table-type COPY_ON_WRITE \ > --base-file-format PARQUET \ > --props ${path_prefix}/testcases/stocks/configs/hoodie.properties \ > --source-class org.apache.hudi.utilities.sources.JsonDFSSource \ > --schemaprovider-class > org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ > --hoodie-conf > hoodie.deltastreamer.schemaprovider.source.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc > \ > --hoodie-conf > hoodie.deltastreamer.schemaprovider.target.schema.file=${path_prefix}/testcases/stocks/data/schema_without_ts.avsc > \ > --op UPSERT \ > --spark-master yarn \ > --hoodie-conf > hoodie.deltastreamer.source.dfs.root=${path_prefix}/testcases/stocks/data/source_without_ts > \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date \ > --hoodie-conf hoodie.datasource.write.keygenerator.type=SIMPLE \ > --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \ > --hoodie-conf hoodie.metadata.enable=true > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7813) Hive Style partitioning on a bootstrap table is not configurable
Jonathan Vexler created HUDI-7813: - Summary: Hive Style partitioning on a bootstrap table is not configurable Key: HUDI-7813 URL: https://issues.apache.org/jira/browse/HUDI-7813 Project: Apache Hudi Issue Type: Bug Components: bootstrap Reporter: Jonathan Vexler I modified DecodedBootstrapPartitionPathTranslator to be: {code:java} public class DecodedBootstrapPartitionPathTranslator extends BootstrapPartitionPathTranslator { public DecodedBootstrapPartitionPathTranslator() { super(); } @Override public String getBootstrapTranslatedPath(String bootStrapPartitionPath) { String pathMaybeWithHive = PartitionPathEncodeUtils.unescapePathName(bootStrapPartitionPath); if (pathMaybeWithHive.contains("=")) { return Arrays.stream(pathMaybeWithHive.split("/")).map(split -> { if (split.contains("=")) { return split.split("=")[1]; } else { return split; } }).collect(Collectors.joining("/")); } return pathMaybeWithHive; } } {code} And setting hive style partitioning to true does not add it back -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation
[ https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7812: - Assignee: sivabalan narayanan > Async Clustering w/ row writer fails due to timetravel query validation > > > Key: HUDI-7812 > URL: https://issues.apache.org/jira/browse/HUDI-7812 > Project: Apache Hudi > Issue Type: Bug > Components: clustering >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > With clustering row writer enabled flow, we trigger a time travel query to > read input records. But the query side fails if there are any pending commits > (due to new ingestion ) whose timestamp < clustering instant time. we need to > relax this constraint. > > {code:java} > Failed to execute CLUSTERING service > java.util.concurrent.CompletionException: > org.apache.hudi.exception.HoodieTimeTravelException: Time travel's timestamp > '20240406123837295' must be earlier than the first incomplete commit > timestamp '20240406123834233'. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) > ~[?:1.8.0_392-internal] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) > ~[?:1.8.0_392-internal] > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606) > ~[?:1.8.0_392-internal] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[?:1.8.0_392-internal] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[?:1.8.0_392-internal] > at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392-internal] > Caused by: org.apache.hudi.exception.HoodieTimeTravelException: Time > travel's timestamp '20240406123837295' must be earlier than the first > incomplete commit timestamp '20240406123834233'. > at > org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf(TimelineUtils.java:369) > ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] > at > org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1(HoodieBaseRelation.scala:416) > ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] > at > org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1$adapted(HoodieBaseRelation.scala:416) > ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] > at scala.Option.foreach(Option.scala:407) > ~[scala-library-2.12.17.jar:?] > at > org.apache.hudi.HoodieBaseRelation.listLatestFileSlices(HoodieBaseRelation.scala:416) > ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] > at > org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:225) > ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] > at > org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:68) > ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] > at > org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:369) > ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:323) > ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:357) > ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:413) > ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:356) > ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:323) > ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > ~[spark-catalyst_2.12-3.2.3.jar:3.2.3] > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > ~[scala-library-2.12.17.jar:?] > at scala.collection.Iterator$$anon$11
[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation
[ https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7812: -- Description: With clustering row writer enabled flow, we trigger a time travel query to read input records. But the query side fails if there are any pending commits (due to new ingestion ) whose timestamp < clustering instant time. we need to relax this constraint. {code:java} Failed to execute CLUSTERING service java.util.concurrent.CompletionException: org.apache.hudi.exception.HoodieTimeTravelException: Time travel's timestamp '20240406123837295' must be earlier than the first incomplete commit timestamp '20240406123834233'. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[?:1.8.0_392-internal] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[?:1.8.0_392-internal] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606) ~[?:1.8.0_392-internal] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_392-internal] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_392-internal] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392-internal] Caused by: org.apache.hudi.exception.HoodieTimeTravelException: Time travel's timestamp '20240406123837295' must be earlier than the first incomplete commit timestamp '20240406123834233'. at org.apache.hudi.common.table.timeline.TimelineUtils.validateTimestampAsOf(TimelineUtils.java:369) ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] at org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1(HoodieBaseRelation.scala:416) ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] at org.apache.hudi.HoodieBaseRelation.$anonfun$listLatestFileSlices$1$adapted(HoodieBaseRelation.scala:416) ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.hudi.HoodieBaseRelation.listLatestFileSlices(HoodieBaseRelation.scala:416) ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] at org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:225) ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] at org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:68) ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] at org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:369) ~[hudi-utilities-bundle_2.12-1.8.1-INTERNAL.jar:1.8.1-INTERNAL] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:323) ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:357) ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:413) ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:356) ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:323) ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) ~[spark-catalyst_2.12-3.2.3.jar:3.2.3] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[scala-library-2.12.17.jar:?] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[scala-library-2.12.17.jar:?] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) ~[scala-library-2.12.17.jar:?] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) ~[spark-catalyst_2.12-3.2.3.jar:3.2.3] at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67) ~[spark-sql_2.12-3.2.3.jar:1.8.1-INTERNAL] at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) ~[spark-catalyst_2.12-3.2.3.jar:3.2.3] at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) ~[scala-library-2.12.17.jar:?] at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) ~[scala-library-2.12.17.jar:?] at scala.collection.Iterator.foreach(Iterator.scala:943) ~[scala-library-2.12.17.
[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation
[ https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7812: -- Description: With clustering row writer enabled flow, we trigger a time travel query to read input records. But the query side fails if there are any pending commits (due to new ingestion ) whose timestamp < clustering instant time. we need to relax this constraint. was: With clustering row writer enabled flow, we trigger a time travel query to read input records. But the query side fails if there are any pending commits (due to new ingestion ) whose timestamp < clustering instant time. we need to relax this constraint. > Async Clustering w/ row writer fails due to timetravel query validation > > > Key: HUDI-7812 > URL: https://issues.apache.org/jira/browse/HUDI-7812 > Project: Apache Hudi > Issue Type: Bug > Components: clustering >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > With clustering row writer enabled flow, we trigger a time travel query to > read input records. But the query side fails if there are any pending commits > (due to new ingestion ) whose timestamp < clustering instant time. we need to > relax this constraint. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation
[ https://issues.apache.org/jira/browse/HUDI-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7812: - Labels: pull-request-available (was: ) > Async Clustering w/ row writer fails due to timetravel query validation > > > Key: HUDI-7812 > URL: https://issues.apache.org/jira/browse/HUDI-7812 > Project: Apache Hudi > Issue Type: Bug > Components: clustering >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > With clustering row writer enabled flow, we trigger a time travel query to > read input records. But the query side fails if there are any pending commits > (due to new ingestion ) whose timestamp < clustering instant time. we need to > relax this constraint. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7812) Async Clustering w/ row writer fails due to timetravel query validation
sivabalan narayanan created HUDI-7812: - Summary: Async Clustering w/ row writer fails due to timetravel query validation Key: HUDI-7812 URL: https://issues.apache.org/jira/browse/HUDI-7812 Project: Apache Hudi Issue Type: Bug Components: clustering Reporter: sivabalan narayanan With clustering row writer enabled flow, we trigger a time travel query to read input records. But the query side fails if there are any pending commits (due to new ingestion ) whose timestamp < clustering instant time. we need to relax this constraint. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path
Sagar Sumit created HUDI-7811: - Summary: Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path Key: HUDI-7811 URL: https://issues.apache.org/jira/browse/HUDI-7811 Project: Apache Hudi Issue Type: Improvement Reporter: Sagar Sumit Fix For: 1.0.0 It will help avoid calling FSUtils.getRelativePartitionPath - https://github.com/apache/hudi/pull/11043#discussion_r1611744651 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
[ https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bradley updated HUDI-7810: -- Description: Fixed in PR: [https://github.com/apache/hudi/pull/11359] (was: Fix OptionsResolver#allowCommitOnEmptyBatch default value bug) > Fix OptionsResolver#allowCommitOnEmptyBatch default value bug > - > > Key: HUDI-7810 > URL: https://issues.apache.org/jira/browse/HUDI-7810 > Project: Apache Hudi > Issue Type: Bug >Reporter: bradley >Priority: Major > Labels: pull-request-available > > Fixed in PR: [https://github.com/apache/hudi/pull/11359] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
[ https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7810: - Labels: pull-request-available (was: ) > Fix OptionsResolver#allowCommitOnEmptyBatch default value bug > - > > Key: HUDI-7810 > URL: https://issues.apache.org/jira/browse/HUDI-7810 > Project: Apache Hudi > Issue Type: Bug >Reporter: bradley >Priority: Major > Labels: pull-request-available > > Fix OptionsResolver#allowCommitOnEmptyBatch default value bug -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
bradley created HUDI-7810: - Summary: Fix OptionsResolver#allowCommitOnEmptyBatch default value bug Key: HUDI-7810 URL: https://issues.apache.org/jira/browse/HUDI-7810 Project: Apache Hudi Issue Type: Bug Reporter: bradley Fix OptionsResolver#allowCommitOnEmptyBatch default value bug -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
[ https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7808: - Labels: pull-request-available (was: ) > Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 > -- > > Key: HUDI-7808 > URL: https://issues.apache.org/jira/browse/HUDI-7808 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7809: - Labels: hoodie-storage pull-request-available (was: hoodie-storage) > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > With Hudi 0.14.1, without > "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi > query in Spark quick start guide succeeds. In Hudi 0.15.0-rc2, without the > Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration. > {code:java} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:806) > at org.apache.spark.sql.Dataset.show(Dataset.scala:765) > at org.apache.spark.sql.Dataset.show(Dataset.scala:774) > ... 47 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) > at > org.apa
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Description: With Hudi 0.14.1, without "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query in Spark quick start guide succeeds. In Hudi 0.15.0-rc2, without the Kryo registratrar, the Hudi read throws NPE due to HadoopStorageConfiguration. {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsR
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Description: With Hudi 0.14.1, without "spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar", Hudi query in Spark quick start guide succeeds. In Hudi 0.15.0-rc2, without the Kryo registratrar, the {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRD
[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits
[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7779: -- Description: Archiving commits from active timeline could lead to data consistency issues on some rarest of occasions. We should come up with proper guards to ensure we do not make such unintended archival. Major gap which we wanted to guard is: if someone disabled cleaner, archival should account for data consistency issues and ensure it bails out. We have a base guarding condition, where archival will stop at the earliest commit to retain based on latest clean commit metadata. But there are few other scenarios that needs to be accounted for. a. Keeping aside replace commits, lets dive into specifics for regular commits and delta commits. Say user configured clean commits to 4 and archival configs to 5 and 6. after t10, cleaner is supposed to clean up all file versions created at or before t6. Say cleaner did not run(for whatever reason for next 5 commits). Archival will certainly be guarded until earliest commit to retain based on latest clean commits. Corner case to consider: A savepoint was added to say t3 and later removed. and still the cleaner was never re-enabled. Even though archival would have been stopped at t3 (until savepoint is present),but once savepoint is removed, if archival is executed, it could archive commit t3. Which means, file versions tracked at t3 is still not yet cleaned by the cleaner. Reasoning: We are good here wrt data consistency. Up until cleaner runs next time, this older file versions might be exposed to the end-user. But time travel query is not intended for already cleaned up commits and hence this is not an issue. None of snapshot, time travel query or incremental query will run into issues as they are not supposed to poll for t3. At any later point, if cleaner is re-enabled, it will take care of cleaning up file versions tracked at t3 commit. Just that for interim period, some older file versions might still be exposed to readers. b. The more tricky part is when replace commits are involved. Since replace commit metadata in active timeline is what ensures the replaced file groups are ignored for reads, before archiving the same, cleaner is expected to clean them up fully. But are there chances that this could go wrong? Corner case to consider. Lets add onto above scenario, where t3 has a savepoint, and t4 is a replace commit which replaced file groups tracked in t3. Cleaner will skip cleaning up files tracked by t3(due to the presence of savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will be pointing to t6. And say savepoint for t3 is removed, but cleaner was disabled. In this state of the timeline, if archival is executed, (since t3.savepoint is removed), archival might archive t3 and t4.rc. This could lead to data duplicates as both replaced file groups and new file groups from t4.rc would be exposed as valid file groups. In other words, if we were to summarize the different scenarios: i. replaced file group is never cleaned up. - ECTR(Earliest commit to retain) is less than this.rc and we are good. ii. replaced file group is cleaned up. - ECTR is > this.rc and is good to archive. iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full clean up did not happen. After savepoint is removed, and when archival is executed, we should avoid archiving the rc of interest. This is the gap we don't account for as of now. We have 3 options to go about to solve this. Option A: Let Savepoint deletion flow take care of cleaning up the files its tracking. cons: Savepoint's responsibility is not removing any data files. So, from a single user responsibility rule, this may not be right. Also, this clean up might need to do what a clean planner might actually be doing. ie. build file system view, understand if its supposed to be cleaned up already, and then only clean up the files which are supposed to be cleaned up. For eg, if a file group has only one file slice, it should not be cleaned up and scenarios like this. Option B: Since archival is the one which might cause data consistency issues, why not archival do the clean up. We need to account for concurrent cleans, failure and retry scenarios etc. Also, we might need to build the file system view and then take a call whether something needs to be cleaned up before archiving something. Cons: Again, the single user responsibility rule might be broken. Would be neat if cleaner takes care of deleting data files and archival only takes care of deleting/archiving timeline files. Option C: Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track another metadata named "EarliestCommitToArchive". Strictly speaking, ear
[jira] [Updated] (HUDI-7655) Support configuration for clean to fail execution if there is at least one file is marked as a failed delete
[ https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7655: -- Fix Version/s: 1.0.0 > Support configuration for clean to fail execution if there is at least one > file is marked as a failed delete > > > Key: HUDI-7655 > URL: https://issues.apache.org/jira/browse/HUDI-7655 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Krishen Bhan >Assignee: sivabalan narayanan >Priority: Minor > Labels: clean, pull-request-available > Fix For: 1.0.0 > > > When a HUDI clean plan is executed, any targeted file that was not confirmed > as deleted (or non-existing) will be marked as a "failed delete". Although > these failed deletes will be added to `.clean` metadata, if incremental clean > is used then these files might not ever be picked up again as a future clean > plan, unless a "full-scan" clean ends up being scheduled. In addition to > leading to more files unnecessarily taking up storage space for longer, then > can lead to the following dataset consistency issue for COW datasets: > # Insert at C1 creates file group f1 in partition > # Replacecommit at RC2 creates file group f2 in partition, and replaces f1 > # Any reader of partition that calls HUDI API (with or without using MDT) > will recognize that f1 should be ignored, as it has been replaced. This is > since RC2 instant file is in active timeline > # Some completed instants later an incremental clean is scheduled. It moves > the "earliest commit to retain" to an time after instant time RC2, so it > targets f1 for deletion. But during execution of the plan, it fails to delete > f1. > # An archive job eventually is triggered, and archives C1 and RC2. Note that > f1 is still in partition > At this point, any job/query that reads the aforementioned partition directly > from the DFS file system calls (without directly using MDT FILES partition) > will consider both f1 and f2 as valid file groups, since RC2 is no longer in > active timeline. This is a data consistency issue, and will only be resolved > if a "full-scan" clean is triggered and deletes f1. > This specific scenario can be avoided if the user can configure HUDI clean to > fail execution of a clean plan unless all files are confirmed as deleted (or > not existing in DFS already), "blocking" the clean. The next clean attempt > will re-execute this existing plan, since clean plans cannot be "rolled > back". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Description: With 0.14 {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2224) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2245) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2264) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:492) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:445) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:152) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90
[jira] [Assigned] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7809: --- Assignee: Ethan Guo > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
Ethan Guo created HUDI-7809: --- Summary: Use Spark SerializableConfiguration to avoid NPE in Kryo serde Key: HUDI-7809 URL: https://issues.apache.org/jira/browse/HUDI-7809 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Fix Version/s: 0.15.0 1.0.0 > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7809) Use Spark SerializableConfiguration to avoid NPE in Kryo serde
[ https://issues.apache.org/jira/browse/HUDI-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7809: Labels: hoodie-storage (was: ) > Use Spark SerializableConfiguration to avoid NPE in Kryo serde > -- > > Key: HUDI-7809 > URL: https://issues.apache.org/jira/browse/HUDI-7809 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5505) Compaction NUM_COMMITS policy should only judge completed deltacommit
[ https://issues.apache.org/jira/browse/HUDI-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5505. Fix Version/s: 1.0.0 Reviewers: Danny Chen Resolution: Fixed Fixed via master branch: 42243862f0271fda16e70afdbfde61b47792ff70 > Compaction NUM_COMMITS policy should only judge completed deltacommit > - > > Key: HUDI-5505 > URL: https://issues.apache.org/jira/browse/HUDI-5505 > Project: Apache Hudi > Issue Type: Bug > Components: compaction, table-service >Reporter: HunterXHunter >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Attachments: image-2023-01-05-13-10-57-918.png > > > `compaction.delta_commits =1` > > {code:java} > 20230105115229301.deltacommit > 20230105115229301.deltacommit.inflight > 20230105115229301.deltacommit.requested > 20230105115253118.commit > 20230105115253118.compaction.inflight > 20230105115253118.compaction.requested > 20230105115330994.deltacommit.inflight > 20230105115330994.deltacommit.requested{code} > The return result of `ScheduleCompactionActionExecutor.needCompact ` is > `true`, > This should not be expected. > > And In the `Occ` or `lazy clean` mode,this will cause compaction trigger > early. > `compaction.delta_commits =3` > > {code:java} > 20230105125650541.deltacommit.inflight > 20230105125650541.deltacommit.requested > 20230105125715081.deltacommit > 20230105125715081.deltacommit.inflight > 20230105125715081.deltacommit.requested > 20230105130018070.deltacommit.inflight > 20230105130018070.deltacommit.requested {code} > > And compaction will be trigger, this should not be expected. > !image-2023-01-05-13-10-57-918.png|width=699,height=158! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source
[ https://issues.apache.org/jira/browse/HUDI-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davis Zhang reassigned HUDI-7806: - Assignee: Davis Zhang > Skip fail on data-loss for first commit on Kafka Source > --- > > Key: HUDI-7806 > URL: https://issues.apache.org/jira/browse/HUDI-7806 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Davis Zhang >Assignee: Davis Zhang >Priority: Major > > When the ingestion attempts to start from the beginning of the topic, we > should not fail on data loss since topic retention can cause failures when > some data is removed before our ingestion is able to fully read the offsets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
Ethan Guo created HUDI-7808: --- Summary: Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 Key: HUDI-7808 URL: https://issues.apache.org/jira/browse/HUDI-7808 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
[ https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7808: Fix Version/s: 1.0.0 > Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 > -- > > Key: HUDI-7808 > URL: https://issues.apache.org/jira/browse/HUDI-7808 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7808) Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45
[ https://issues.apache.org/jira/browse/HUDI-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7808: --- Assignee: Ethan Guo > Security upgrade io.acryl:datahub-client from 0.8.31 to 0.8.45 > -- > > Key: HUDI-7808 > URL: https://issues.apache.org/jira/browse/HUDI-7808 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
[ https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7807: - Labels: pull-request-available (was: ) > spark-sql updates for a pk less table fails w/ partitioned table > - > > Key: HUDI-7807 > URL: https://issues.apache.org/jira/browse/HUDI-7807 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > quick start fails when trying to UPDATE with spark-sql for a pk less table. > > {code:java} > > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; > 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET > fare = 25.0 WHERE rider = 'rider-D'] > org.apache.hudi.exception.HoodieException: Unable to instantiate class > org.apache.hudi.keygen.SimpleKeyGenerator > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) > at > org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) > at > org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNod
[jira] [Created] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
sivabalan narayanan created HUDI-7807: - Summary: spark-sql updates for a pk less table fails w/ partitioned table Key: HUDI-7807 URL: https://issues.apache.org/jira/browse/HUDI-7807 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: sivabalan narayanan quick start fails when trying to UPDATE with spark-sql for a pk less table. {code:java} > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'] org.apache.hudi.exception.HoodieException: Unable to instantiate class org.apache.hudi.keygen.SimpleKeyGenerator at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) at org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) at org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:
[jira] [Updated] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
[ https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7807: -- Fix Version/s: 0.15.0 1.0.0 > spark-sql updates for a pk less table fails w/ partitioned table > - > > Key: HUDI-7807 > URL: https://issues.apache.org/jira/browse/HUDI-7807 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.15.0, 1.0.0 > > > quick start fails when trying to UPDATE with spark-sql for a pk less table. > > {code:java} > > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; > 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET > fare = 25.0 WHERE rider = 'rider-D'] > org.apache.hudi.exception.HoodieException: Unable to instantiate class > org.apache.hudi.keygen.SimpleKeyGenerator > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) > at > org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) > at > org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.exe
[jira] [Assigned] (HUDI-7807) spark-sql updates for a pk less table fails w/ partitioned table
[ https://issues.apache.org/jira/browse/HUDI-7807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7807: - Assignee: sivabalan narayanan > spark-sql updates for a pk less table fails w/ partitioned table > - > > Key: HUDI-7807 > URL: https://issues.apache.org/jira/browse/HUDI-7807 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > quick start fails when trying to UPDATE with spark-sql for a pk less table. > > {code:java} > > UPDATE hudi_table4 SET fare = 25.0 WHERE rider = 'rider-D'; > 24/05/28 11:44:41 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 24/05/28 11:44:41 ERROR SparkSQLDriver: Failed in [UPDATE hudi_table4 SET > fare = 25.0 WHERE rider = 'rider-D'] > org.apache.hudi.exception.HoodieException: Unable to instantiate class > org.apache.hudi.keygen.SimpleKeyGenerator > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:75) > at > org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:123) > at > org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory.createKeyGenerator(HoodieSparkKeyGeneratorFactory.java:91) > at > org.apache.hudi.util.SparkKeyGenUtils$.getPartitionColumns(SparkKeyGenUtils.scala:47) > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:218) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232) > at > org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.execution.QueryExecution.eagerlyE
[jira] [Created] (HUDI-7806) Skip fail on data-loss for first commit on Kafka Source
Davis Zhang created HUDI-7806: - Summary: Skip fail on data-loss for first commit on Kafka Source Key: HUDI-7806 URL: https://issues.apache.org/jira/browse/HUDI-7806 Project: Apache Hudi Issue Type: Improvement Reporter: Davis Zhang When the ingestion attempts to start from the beginning of the topic, we should not fail on data loss since topic retention can cause failures when some data is removed before our ingestion is able to fully read the offsets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7805) FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed
[ https://issues.apache.org/jira/browse/HUDI-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7805: - Labels: pull-request-available (was: ) > FileSystemBasedLockProvider need delete lock file auto when occur lock > conflict to avoid next write failed > -- > > Key: HUDI-7805 > URL: https://issues.apache.org/jira/browse/HUDI-7805 > Project: Apache Hudi > Issue Type: Improvement > Components: multi-writer >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > > org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock > object hdfs://aa-region/region04/2211/warehouse/hudi/odsmon_log/.hoodie/lock > at > org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:100) > at > org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:58) > at > org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1258) > at > org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1301) > at > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:139) > at > org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:216) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:396) > at > org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:108) > at > org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:61) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) > at > org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) > at > org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91) > at org.apache.spark.sql.Dataset.(Dataset.scala:219) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) > at org.apache.spark.sql.SparkSession.withActive(SparkSess
[jira] [Created] (HUDI-7805) FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed
xy created HUDI-7805: Summary: FileSystemBasedLockProvider need delete lock file auto when occur lock conflict to avoid next write failed Key: HUDI-7805 URL: https://issues.apache.org/jira/browse/HUDI-7805 Project: Apache Hudi Issue Type: Improvement Components: multi-writer Reporter: xy Assignee: xy org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object hdfs://aa-region/region04/2211/warehouse/hudi/odsmon_log/.hoodie/lock at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:100) at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:58) at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1258) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1301) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:139) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:216) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:396) at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:108) at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:61) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91) at org.apache.spark.sql.Dataset.(Dataset.scala:219) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613) at com.vivo.bigdata.etl.process.EtlProcessMain$.main(EtlProcessMain.scala:367) at com.vivo.bigdata.etl.process.EtlProcessMain.main(EtlProcessMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
[jira] [Updated] (HUDI-7804) Improve flink bucket index partitioner
[ https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7804: - Sprint: Sprint 2023-04-26 > Improve flink bucket index partitioner > -- > > Key: HUDI-7804 > URL: https://issues.apache.org/jira/browse/HUDI-7804 > Project: Apache Hudi > Issue Type: Bug >Reporter: xi chaomin >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/issues/11288 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7804) Improve flink bucket index partitioner
[ https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen reassigned HUDI-7804: Assignee: Danny Chen > Improve flink bucket index partitioner > -- > > Key: HUDI-7804 > URL: https://issues.apache.org/jira/browse/HUDI-7804 > Project: Apache Hudi > Issue Type: Bug >Reporter: xi chaomin >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/issues/11288 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7804) Improve flink bucket index partitioner
[ https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7804: - Labels: pull-request-available (was: ) > Improve flink bucket index partitioner > -- > > Key: HUDI-7804 > URL: https://issues.apache.org/jira/browse/HUDI-7804 > Project: Apache Hudi > Issue Type: Bug >Reporter: xi chaomin >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/issues/11288 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7804) Improve flink bucket index partitioner
[ https://issues.apache.org/jira/browse/HUDI-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xi chaomin updated HUDI-7804: - Description: https://github.com/apache/hudi/issues/11288 > Improve flink bucket index partitioner > -- > > Key: HUDI-7804 > URL: https://issues.apache.org/jira/browse/HUDI-7804 > Project: Apache Hudi > Issue Type: Bug >Reporter: xi chaomin >Priority: Major > > https://github.com/apache/hudi/issues/11288 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7804) Improve flink bucket index partitioner
xi chaomin created HUDI-7804: Summary: Improve flink bucket index partitioner Key: HUDI-7804 URL: https://issues.apache.org/jira/browse/HUDI-7804 Project: Apache Hudi Issue Type: Bug Reporter: xi chaomin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7795) Fix loading of input splits from look up table reader
[ https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7795: - Status: Patch Available (was: In Progress) > Fix loading of input splits from look up table reader > - > > Key: HUDI-7795 > URL: https://issues.apache.org/jira/browse/HUDI-7795 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7795) Fix loading of input splits from look up table reader
[ https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7795: - Status: In Progress (was: Open) > Fix loading of input splits from look up table reader > - > > Key: HUDI-7795 > URL: https://issues.apache.org/jira/browse/HUDI-7795 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7507: - Labels: pull-request-available (was: ) > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > *Scenarios:* > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > ** If the completed commit files include som sort of "checkpointing" with > another "downstream job" performing incremental reads on this dataset (such > as Hoodie Streamer/DeltaSync) then there may be incorrect behavior, such as > the incremental reader skipping some completed commits (that have a smaller > instant timestamp than latest completed commit but were created after). > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > *Proposed approach:* > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > Approach A > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately (A) has the following drawbacks > * Every operation must now hold the table lock when computing its plan even > if it's an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this and would > require deprecating those APIs. > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline that are greater than it that could cause a conflict. If that > assertion fails, then throw a retry-able conflict resolution exception. > Specifically, the following steps should be followed whenever any instant > (commit, table service, etc) is scheduled > Approach B > # Acquire table lock. Assume that the desired instant time C and requested > file plan metadata have already been created, regardless of wether it was > before this step or right after acquiring the table lock. > # If there are any instants on the timeline that are greater than C > (regardless of their operation type or sate status) then release table lock > and throw an exception > # Create requested plan on timeline (As usual) > # Release table lock > Unlike (A), thi
[jira] [Updated] (HUDI-7795) Fix loading of input splits from look up table reader
[ https://issues.apache.org/jira/browse/HUDI-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7795: - Fix Version/s: 1.0.0 > Fix loading of input splits from look up table reader > - > > Key: HUDI-7795 > URL: https://issues.apache.org/jira/browse/HUDI-7795 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)