from:"Jira"



 [ 
https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7824:
-
Labels: pull-request-available  (was: )

> Fix incremental partitions fetch logic when savepoint is removed for Incr 
> cleaner
> -
>
> Key: HUDI-7824
> URL: https://issues.apache.org/jira/browse/HUDI-7824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
> and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
> removed later, cleaner should account for cleaning up the commit of interest. 
>  
> Lets ensure clean planner account for all partitions when such savepoint 
> removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-05-31 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-7824:
-

 Summary: Fix incremental partitions fetch logic when savepoint is 
removed for Incr cleaner
 Key: HUDI-7824
 URL: https://issues.apache.org/jira/browse/HUDI-7824
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning
Reporter: sivabalan narayanan


with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
removed later, cleaner should account for cleaning up the commit of interest. 

 

Lets ensure clean planner account for all partitions when such savepoint 
removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7823) Simplify dependency management on exclusions



 [ 
https://issues.apache.org/jira/browse/HUDI-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7823:
-
Labels: pull-request-available  (was: )

> Simplify dependency management on exclusions
> 
>
> Key: HUDI-7823
> URL: https://issues.apache.org/jira/browse/HUDI-7823
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7823) Simplify dependency management on exclusions

Ethan Guo created HUDI-7823:
---

 Summary: Simplify dependency management on exclusions
 Key: HUDI-7823
 URL: https://issues.apache.org/jira/browse/HUDI-7823
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests



 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7822:
-
Labels: pull-request-available  (was: )

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests



[ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851205#comment-17851205
 ] 

Ethan Guo commented on HUDI-7822:
-

https://github.com/apache/hudi/pull/10931

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

Ethan Guo created HUDI-7822:
---

 Summary: Resolve the conflicts between mixed hdfs and local path 
in Flink tests
 Key: HUDI-7822
 URL: https://issues.apache.org/jira/browse/HUDI-7822
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests



 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7822:

Fix Version/s: 1.0.0

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7821) Handle schema evolution in proto to avro conversion



 [ 
https://issues.apache.org/jira/browse/HUDI-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7821:
-
Labels: pull-request-available  (was: )

> Handle schema evolution in proto to avro conversion
> ---
>
> Key: HUDI-7821
> URL: https://issues.apache.org/jira/browse/HUDI-7821
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> Users can encounter errors when a batch of data was written with an older 
> schema and a new schema has fields that are not present in the old data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7821) Handle schema evolution in proto to avro conversion

2024-05-31 Thread Timothy Brown (Jira)

Timothy Brown created HUDI-7821:
---

 Summary: Handle schema evolution in proto to avro conversion
 Key: HUDI-7821
 URL: https://issues.apache.org/jira/browse/HUDI-7821
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Timothy Brown


Users can encounter errors when a batch of data was written with an older 
schema and a new schema has fields that are not present in the old data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path

2024-05-31 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7811.
-
Resolution: Fixed

Fixed in the original PR itself - 
https://github.com/apache/hudi/pull/11043#discussion_r1621825753

> Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path
> -
>
> Key: HUDI-7811
> URL: https://issues.apache.org/jira/browse/HUDI-7811
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> It will help avoid calling FSUtils.getRelativePartitionPath - 
> https://github.com/apache/hudi/pull/11043#discussion_r1611744651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7811) Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path

2024-05-31 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7811:
-

Assignee: Sagar Sumit

> Enhance SparkBaseIndexSupport.getPrunedFileNames to return partition path
> -
>
> Key: HUDI-7811
> URL: https://issues.apache.org/jira/browse/HUDI-7811
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> It will help avoid calling FSUtils.getRelativePartitionPath - 
> https://github.com/apache/hudi/pull/11043#discussion_r1611744651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7820) For bloom index reader path, prune based on min/max if colstats is enabled

2024-05-31 Thread Sagar Sumit (Jira)

Sagar Sumit created HUDI-7820:
-

 Summary: For bloom index reader path, prune based on min/max if 
colstats is enabled
 Key: HUDI-7820
 URL: https://issues.apache.org/jira/browse/HUDI-7820
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Sagar Sumit
 Fix For: 1.1.0, 1.0.0


Bloom filters can result in false positives. We can try to prune files based on 
min/max if colstats is available for the field. 
https://github.com/apache/hudi/pull/11043#discussion_r1621639791



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug



 [ 
https://issues.apache.org/jira/browse/HUDI-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7819:
-
Labels: pull-request-available  (was: )

> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
> -
>
> Key: HUDI-7819
> URL: https://issues.apache.org/jira/browse/HUDI-7819
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7810) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-30 Thread bradley (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bradley closed HUDI-7810.
-
Resolution: Later

> Fix OptionsResolver#allowCommitOnEmptyBatch default value bug
> -
>
> Key: HUDI-7810
> URL: https://issues.apache.org/jira/browse/HUDI-7810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: bradley
>Priority: Major
>  Labels: pull-request-available
>
> Fixed in PR: [https://github.com/apache/hudi/pull/11359]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7819) Fix OptionsResolver#allowCommitOnEmptyBatch default value bug

2024-05-30 Thread bradley (Jira)

bradley created HUDI-7819:
-

 Summary: Fix OptionsResolver#allowCommitOnEmptyBatch default value 
bug
 Key: HUDI-7819
 URL: https://issues.apache.org/jira/browse/HUDI-7819
 Project: Apache Hudi
  Issue Type: Bug
Reporter: bradley






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7818) Flink Table planner not loading problem

2024-05-30 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7818:
-
Sprint: Sprint 2023-04-26

> Flink Table planner not loading problem
> ---
>
> Key: HUDI-7818
> URL: https://issues.apache.org/jira/browse/HUDI-7818
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7818) Flink Table planner not loading problem

2024-05-30 Thread Danny Chen (Jira)

Danny Chen created HUDI-7818:


 Summary: Flink Table planner not loading problem
 Key: HUDI-7818
 URL: https://issues.apache.org/jira/browse/HUDI-7818
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding



 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7817:
-
Labels: pull-request-available  (was: )

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of Jackson Core 
> (com.fasterxml.jackson.core:jackson-core).  
> org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
> should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding



 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Description: org.codehaus.jackson is a older version of Jackson Core 
(com.fasterxml.jackson.core:jackson-core).  
org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
should be avoided.  (was: org.codehaus.jackson is a older version of Jackson 
Core (com.fasterxml.jackson.core:jackson-core).  
org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
should be avoid.)

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of Jackson Core 
> (com.fasterxml.jackson.core:jackson-core).  
> org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
> should be avoided.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding



 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Description: org.codehaus.jackson is a older version of Jackson Core 
(com.fasterxml.jackson.core:jackson-core).  
org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
should be avoid.  (was: org.codehaus.jackson is a older version of )

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of Jackson Core 
> (com.fasterxml.jackson.core:jackson-core).  
> org.codehaus.jackson:jackson-mapper-asl has critical vulnerabilities which 
> should be avoid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding



 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Description: org.codehaus.jackson is a older version of 

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> org.codehaus.jackson is a older version of 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding



 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7817:
---

Assignee: Ethan Guo

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding



 [ 
https://issues.apache.org/jira/browse/HUDI-7817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7817:

Fix Version/s: 1.0.0

> Use Jackson Core instead of org.codehaus.jackson for JSON encoding
> --
>
> Key: HUDI-7817
> URL: https://issues.apache.org/jira/browse/HUDI-7817
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7817) Use Jackson Core instead of org.codehaus.jackson for JSON encoding

Ethan Guo created HUDI-7817:
---

 Summary: Use Jackson Core instead of org.codehaus.jackson for JSON 
encoding
 Key: HUDI-7817
 URL: https://issues.apache.org/jira/browse/HUDI-7817
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7816) Pass the source profile to the snapshot query splitter



 [ 
https://issues.apache.org/jira/browse/HUDI-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7816:
-
Labels: pull-request-available  (was: )

> Pass the source profile to the snapshot query splitter
> --
>
> Key: HUDI-7816
> URL: https://issues.apache.org/jira/browse/HUDI-7816
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7816) Pass the source profile to the snapshot query splitter

2024-05-30 Thread Rajesh Mahindra (Jira)

Rajesh Mahindra created HUDI-7816:
-

 Summary: Pass the source profile to the snapshot query splitter
 Key: HUDI-7816
 URL: https://issues.apache.org/jira/browse/HUDI-7816
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Rajesh Mahindra






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7779) Guarding archival to not archive unintended commits

2024-05-30 Thread sivabalan narayanan (Jira)

[
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

sivabalan narayanan updated HUDI-7779:
--
Description:
Archiving commits from active timeline could lead to data consistency issues on
some rarest of occasions. We should come up with proper guards to ensure we do
not make such unintended archival.

Major gap which we wanted to guard is:

if someone disabled cleaner, archival should account for data consistency
issues and ensure it bails out.

We have a base guarding condition, where archival will stop at the earliest
commit to retain based on latest clean commit metadata. But there are few other
scenarios that needs to be accounted for.

a. Keeping aside replace commits, lets dive into specifics for regular commits
and delta commits.

Say user configured clean commits to 4 and archival configs to 5 and 6. after
t10, cleaner is supposed to clean up all file versions created at or before t6.
Say cleaner did not run(for whatever reason for next 5 commits).

Archival will certainly be guarded until earliest commit to retain based on
latest clean commits.

Corner case to consider:

A savepoint was added to say t3 and later removed. and still the cleaner was
never re-enabled. Even though archival would have been stopped at t3 (until
savepoint is present),but once savepoint is removed, if archival is executed,
it could archive commit t3. Which means, file versions tracked at t3 is still
not yet cleaned by the cleaner.

Reasoning:

We are good here wrt data consistency. Up until cleaner runs next time, this
older file versions might be exposed to the end-user. But time travel query is
not intended for already cleaned up commits and hence this is not an issue.
None of snapshot, time travel query or incremental query will run into issues
as they are not supposed to poll for t3.

At any later point, if cleaner is re-enabled, it will take care of cleaning up
file versions tracked at t3 commit. Just that for interim period, some older
file versions might still be exposed to readers.

b. The more tricky part is when replace commits are involved. Since replace
commit metadata in active timeline is what ensures the replaced file groups are
ignored for reads, before archiving the same, cleaner is expected to clean them
up fully. But are there chances that this could go wrong?

Corner case to consider. Lets add onto above scenario, where t3 has a
savepoint, and t4 is a replace commit which replaced file groups tracked in t3.

Cleaner will skip cleaning up files tracked by t3(due to the presence of
savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain will
be pointing to t6. And say savepoint for t3 is removed, but cleaner was
disabled. In this state of the timeline, if archival is executed, (since
t3.savepoint is removed), archival might archive t3 and t4.rc. This could lead
to data duplicates as both replaced file groups and new file groups from t4.rc
would be exposed as valid file groups.

In other words, if we were to summarize the different scenarios:

i. replaced file group is never cleaned up.
- ECTR(Earliest commit to retain) is less than this.rc and we are good.
ii. replaced file group is cleaned up.
- ECTR is > this.rc and is good to archive.
iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full
clean up did not happen. After savepoint is removed, and when archival is
executed, we should avoid archiving the rc of interest. This is the gap we
don't account for as of now.

We have 3 options to go about to solve this.

Option A:

Let Savepoint deletion flow take care of cleaning up the files its tracking.

cons:

Savepoint's responsibility is not removing any data files. So, from a single
user responsibility rule, this may not be right. Also, this clean up might need
to do what a clean planner might actually be doing. ie. build file system view,
understand if its supposed to be cleaned up already, and then only clean up the
files which are supposed to be cleaned up. For eg, if a file group has only one
file slice, it should not be cleaned up and scenarios like this.

Option B:

Since archival is the one which might cause data consistency issues, why not
archival do the clean up.

We need to account for concurrent cleans, failure and retry scenarios etc.
Also, we might need to build the file system view and then take a call whether
something needs to be cleaned up before archiving something.

Cons:

Again, the single user responsibility rule might be broken. Would be neat if
cleaner takes care of deleting data files and archival only takes care of
deleting/archiving timeline files.

Option C:

Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track
another metadata named "EarliestCommitToArchive". Strictly speaking, ear

[jira] [Closed] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs

2024-05-30 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7407.
-
Resolution: Fixed

> Add optional clean support to standalone compaction and clustering jobs
> ---
>
> Key: HUDI-7407
> URL: https://issues.apache.org/jira/browse/HUDI-7407
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Lets add top level config to standalone compaction and clustering job to 
> optionally clean. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7815) Multiple writer with bulkinsert getAllPendingClusteringPlans should refresh timeline