[jira] [Created] (HUDI-7716) Add more logs around index lookup

2024-05-06 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7716:
-

 Summary: Add more logs around index lookup
 Key: HUDI-7716
 URL: https://issues.apache.org/jira/browse/HUDI-7716
 Project: Apache Hudi
  Issue Type: Improvement
  Components: index
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7712) Account for file slices instead of just base files while initializing RLI for MOR table

2024-05-05 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7712:
-

 Summary: Account for file slices instead of just base files while 
initializing RLI for MOR table
 Key: HUDI-7712
 URL: https://issues.apache.org/jira/browse/HUDI-7712
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


we could have deletes in log files. and hence we need to account for entire 
file slice instead of just base files while initializing RLI for MOR table. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7673) Enhance RLI validation w/ MDT validator for false positives

2024-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7673:
-

Assignee: sivabalan narayanan

> Enhance RLI validation w/ MDT validator for false positives
> ---
>
> Key: HUDI-7673
> URL: https://issues.apache.org/jira/browse/HUDI-7673
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> There is a chance that we could see false positive failures w/ MDT validation 
> when RLI is validated. 
>  
> When FS based record key locations are polled, we could have a pending 
> commit. and when MDT is polled or record locations, the commit could have 
> been completed. And so, rli validation could return additional record 
> locations.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7687) Instant should not be archived until replaced file groups or older file versions are deleted

2024-04-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7687:
-

Assignee: sivabalan narayanan

> Instant should not be archived until replaced file groups or older file 
> versions are deleted
> 
>
> Key: HUDI-7687
> URL: https://issues.apache.org/jira/browse/HUDI-7687
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: archive, clean
>
> When archival runs it may consider an instant as a candidate for archival 
> even if the file groups said instant replaced/updated still need to undergo a 
> `clean`. For example, consider the following scenario with clean and archived 
> scheduled/executed independently in different jobs
>  # Insert at C1 creates file group f1 in partition
>  # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
>  # Any reader of partition that calls HUDI API (with or without using MDT) 
> will recognize that f1 should be ignored, as it has been replaced. This is 
> since RC2 instant file is in active timeline
>  # Some more instants are added to timeline. RC2 is now eligible to be 
> cleaned (as per the table writers' clean policy). Assume though that file 
> groups replaces by RC2 haven't been deleted yet, such as due to clean 
> repeatedly failing, async clean not being scheduled yet, or the clean failing 
> to delete said file groups.
>  # An archive job eventually is triggered, and archives C1 and RC2. Note that 
> f1 is still in partition
> Now the table has the same consistency issue as seen in 
> https://issues.apache.org/jira/browse/HUDI-7655 , where replaced file groups 
> are still in partition and readers may see inconsistent data. 
>  
> This situation can be avoided by ensuring that archival will "block" and no 
> go past an older instant time if it sees that said instant didn't undergo a 
> clean yet. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7655) Support configuration for clean to fail execution if there is at least one file is marked as a failed delete

2024-04-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7655:
-

Assignee: sivabalan narayanan

> Support configuration for clean to fail execution if there is at least one 
> file is marked as a failed delete
> 
>
> Key: HUDI-7655
> URL: https://issues.apache.org/jira/browse/HUDI-7655
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: clean
>
> When a HUDI clean plan is executed, any targeted file that was not confirmed 
> as deleted (or non-existing) will be marked as a "failed delete". Although 
> these failed deletes will be added to `.clean` metadata, if incremental clean 
> is used then these files might not ever be picked up again as a future clean 
> plan, unless a "full-scan" clean ends up being scheduled. In addition to 
> leading to more files unnecessarily taking up storage space for longer, then 
> can lead to the following dataset consistency issue for COW datasets:
>  # Insert at C1 creates file group f1 in partition
>  # Replacecommit at RC2 creates file group f2 in partition, and replaces f1
>  # Any reader of partition that calls HUDI API (with or without using MDT) 
> will recognize that f1 should be ignored, as it has been replaced. This is 
> since RC2 instant file is in active timeline
>  # Some completed instants later an incremental clean is scheduled. It moves 
> the "earliest commit to retain" to an time after instant time RC2, so it 
> targets f1 for deletion. But during execution of the plan, it fails to delete 
> f1.
>  # An archive job eventually is triggered, and archives C1 and RC2. Note that 
> f1 is still in partition
> At this point, any job/query that reads the aforementioned partition directly 
> from the DFS file system calls (without directly using MDT FILES partition) 
> will consider both f1 and f2 as valid file groups, since RC2 is no longer in 
> active timeline. This is a data consistency issue, and will only be resolved 
> if a "full-scan" clean is triggered and deletes f1.
> This specific scenario can be avoided if the user can configure HUDI clean to 
> fail execution of a clean plan unless all files are confirmed as deleted (or 
> not existing in DFS already), "blocking" the clean. The next clean attempt 
> will re-execute this existing plan, since clean plans cannot be "rolled 
> back". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7673) Enhance RLI validation w/ MDT validator for false positives

2024-04-25 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7673:
-

 Summary: Enhance RLI validation w/ MDT validator for false 
positives
 Key: HUDI-7673
 URL: https://issues.apache.org/jira/browse/HUDI-7673
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


There is a chance that we could see false positive failures w/ MDT validation 
when RLI is validated. 

 

When FS based record key locations are polled, we could have a pending commit. 
and when MDT is polled or record locations, the commit could have been 
completed. And so, rli validation could return additional record locations.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7659) Update 0.14.0 release docs to call out that row writer w/ clustering is enabled by default

2024-04-23 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7659:
-

 Summary: Update 0.14.0 release docs to call out that row writer w/ 
clustering is enabled by default
 Key: HUDI-7659
 URL: https://issues.apache.org/jira/browse/HUDI-7659
 Project: Apache Hudi
  Issue Type: Improvement
  Components: docs
Reporter: sivabalan narayanan


Update 0.14.0 release docs to call out that row writer w/ clustering is enabled 
by default

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7645) Optimize BQ sync tool for MDT

2024-04-20 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7645:
-

 Summary: Optimize BQ sync tool for MDT
 Key: HUDI-7645
 URL: https://issues.apache.org/jira/browse/HUDI-7645
 Project: Apache Hudi
  Issue Type: Improvement
  Components: meta-sync
Reporter: sivabalan narayanan


Looks like in BQ sync, we are polling fsview for latest files sequentially for 
every partition. 

 

When MDT is enabled, we could load all partitions in one call. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7644:
--
Fix Version/s: 1.0.0

> Add record key info with RLI validation in MDT Validator
> 
>
> Key: HUDI-7644
> URL: https://issues.apache.org/jira/browse/HUDI-7644
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7644:
--
Fix Version/s: 0.15.0

> Add record key info with RLI validation in MDT Validator
> 
>
> Key: HUDI-7644
> URL: https://issues.apache.org/jira/browse/HUDI-7644
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.15.0
>
>
> Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7644:
-

Assignee: sivabalan narayanan

> Add record key info with RLI validation in MDT Validator
> 
>
> Key: HUDI-7644
> URL: https://issues.apache.org/jira/browse/HUDI-7644
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7644) Add record key info with RLI validation in MDT Validator

2024-04-20 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7644:
-

 Summary: Add record key info with RLI validation in MDT Validator
 Key: HUDI-7644
 URL: https://issues.apache.org/jira/browse/HUDI-7644
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata, tests-ci
Reporter: sivabalan narayanan


Add record key info with RLI validation in MDT Validator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-19 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7641:
--
Fix Version/s: 0.15.0

> Add metrics to track what partitions are enabled in MDT
> ---
>
> Key: HUDI-7641
> URL: https://issues.apache.org/jira/browse/HUDI-7641
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-19 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7641:
-

Assignee: sivabalan narayanan

> Add metrics to track what partitions are enabled in MDT
> ---
>
> Key: HUDI-7641
> URL: https://issues.apache.org/jira/browse/HUDI-7641
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-18 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7641:
-

 Summary: Add metrics to track what partitions are enabled in MDT
 Key: HUDI-7641
 URL: https://issues.apache.org/jira/browse/HUDI-7641
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7624) Fix index lookup duration to track tag location duration

2024-04-16 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7624:
-

 Summary: Fix index lookup duration to track tag location duration
 Key: HUDI-7624
 URL: https://issues.apache.org/jira/browse/HUDI-7624
 Project: Apache Hudi
  Issue Type: Bug
  Components: index
Reporter: sivabalan narayanan


With spark lazy evaluation, we can't start a timer before tagLocation call and 
end the timer later. This may not give us the right value for tag location 
duration. So, we need to fix the duration properly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-04-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Fix Version/s: 1.0.0

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
>  
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> -An alternate approach (suggested by- [~pwason] -) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.-
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline (inflight or otherwise) that are greater than it. If that assertion 
> fails, then throw a retry-able conflict resolution exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-04-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Fix Version/s: 0.15.0

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
> Fix For: 0.15.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
>  
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> -An alternate approach (suggested by- [~pwason] -) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.-
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline (inflight or otherwise) that are greater than it. If that assertion 
> fails, then throw a retry-able conflict resolution exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-4699) Primary key-less data model

2024-04-01 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-4699.
---

> Primary key-less data model
> ---
>
> Key: HUDI-4699
> URL: https://issues.apache.org/jira/browse/HUDI-4699
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
>
> Hudi requires users to specify a primary key field. Can we do away with this 
> requirement? This epic tracks the work to support use cases which does not 
> require primary key based data modelling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4699) Primary key-less data model

2024-04-01 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4699.
-
Fix Version/s: 0.14.0
   Resolution: Fixed

> Primary key-less data model
> ---
>
> Key: HUDI-4699
> URL: https://issues.apache.org/jira/browse/HUDI-4699
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Hudi requires users to specify a primary key field. Can we do away with this 
> requirement? This epic tracks the work to support use cases which does not 
> require primary key based data modelling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HUDI-4699) Primary key-less data model

2024-04-01 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reopened HUDI-4699:
---
Assignee: sivabalan narayanan

> Primary key-less data model
> ---
>
> Key: HUDI-4699
> URL: https://issues.apache.org/jira/browse/HUDI-4699
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Hudi requires users to specify a primary key field. Can we do away with this 
> requirement? This epic tracks the work to support use cases which does not 
> require primary key based data modelling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7556) Fix MDT validator to account for additional partitions in MDT

2024-03-29 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7556:
-

 Summary: Fix MDT validator to account for additional partitions in 
MDT
 Key: HUDI-7556
 URL: https://issues.apache.org/jira/browse/HUDI-7556
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


There is a chance that MDT could list additional partitions when compared to FS 
based listing. 

reason is: 

We load active timeline from metaclient and poll FS based listing for completed 
commits. And then we poll MDT for list of all partitions. in between these two, 
there could be a commit that could have been completed and hence MDT could be 
serving that as well. So, lets account for that in our validation tool 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7549) Data inconsistency issue w/ spurious log block detection

2024-03-25 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7549:
-

 Summary: Data inconsistency issue w/ spurious log block detection
 Key: HUDI-7549
 URL: https://issues.apache.org/jira/browse/HUDI-7549
 Project: Apache Hudi
  Issue Type: Bug
  Components: reader-core
Reporter: sivabalan narayanan


We added support to deduce spurious log blocks with log block reader 

[https://github.com/apache/hudi/pull/9545]

[https://github.com/apache/hudi/pull/9611] 

in 0.14.0. 

Aparrently there are some cases where it could lead to data loss or data 
consistency issues. 

[https://github.com/apache/hudi/pull/9611#issuecomment-2016687160] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7532) Fix schedule compact to only consider DCs after last compaction commit

2024-03-22 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7532:
-

 Summary: Fix schedule compact to only consider DCs after last 
compaction commit 
 Key: HUDI-7532
 URL: https://issues.apache.org/jira/browse/HUDI-7532
 Project: Apache Hudi
  Issue Type: Bug
  Components: compaction
Reporter: sivabalan narayanan


Fix schedule compact to only consider DCs after last compaction commit. As of 
now, it also considers replace commit. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7528) Fix RowCustomColumnsSortPartitioner to use repartition instead of coalesce

2024-03-22 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7528:
-

 Summary: Fix RowCustomColumnsSortPartitioner to use repartition 
instead of coalesce
 Key: HUDI-7528
 URL: https://issues.apache.org/jira/browse/HUDI-7528
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: sivabalan narayanan


Fix RowCustomColumnsSortPartitioner to use repartition instead of coalesce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7526) Fix constructors for all bulk insert sort partitioners to ensure we could use it as user defined partitioners

2024-03-21 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7526:
-

 Summary: Fix constructors for all bulk insert sort partitioners to 
ensure we could use it as user defined partitioners 
 Key: HUDI-7526
 URL: https://issues.apache.org/jira/browse/HUDI-7526
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: sivabalan narayanan


Our constructor for user defined sort partitioner takes in write config, while 
some of the partitioners used in out of the box sort mode, does not account for 
it. 

 

Lets fix the sort partitioners to ensure anything can be used as user defined 
partitioners. 

For eg, NoneSortMode does not have a constructor that takes in write config 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828123#comment-17828123
 ] 

sivabalan narayanan edited comment on HUDI-7507 at 3/19/24 1:12 AM:


Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 


was (Author: shivnarayan):
Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828142#comment-17828142
 ] 

sivabalan narayanan commented on HUDI-7507:
---

We have already fixed it w/ latest master (1.0) by generating the new commit 
times using locks. That should solve the issue. We can apply the same to 0.X 
branch. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-03-18 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828123#comment-17828123
 ] 

sivabalan narayanan commented on HUDI-7507:
---

Just trying to replay the same scenario for data table, conflict resolution 
could have aborted job2. and hence we may not hit the same issue. 

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Krishen Bhan
>Priority: Major
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> An alternate approach (suggested by [~pwason] ) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7511) Offset range calculation in kafka should return all topic partitions

2024-03-17 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7511:
-

 Summary: Offset range calculation in kafka should return all topic 
partitions 
 Key: HUDI-7511
 URL: https://issues.apache.org/jira/browse/HUDI-7511
 Project: Apache Hudi
  Issue Type: Bug
  Components: deltastreamer
Reporter: sivabalan narayanan


after [https://github.com/apache/hudi/pull/10869] got landed, we are not 
returning every topic partition in final ranges. But for checkpointing purpose, 
we need to have every kafka topic partition in final ranges even if we are not 
consuming anything. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7491) Handle null extra metadata w/ clean commit metadata

2024-03-07 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7491:
-

 Summary: Handle null extra metadata w/ clean commit metadata
 Key: HUDI-7491
 URL: https://issues.apache.org/jira/browse/HUDI-7491
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning
Reporter: sivabalan narayanan


[https://github.com/apache/hudi/pull/10651/] 

 

After this fix, older clean commits may not have any extra metadata. we need to 
handle null for the entire map. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

2024-03-07 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7490:
--
Description: 
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

Scenario above patch fixes:

By default incremental cleaner is enabled. Cleaner during planning, will only 
account for partitions touched in recent commits (after earliest commit to 
retain from last completed clean). 

So, if there is a savepoint added and removed later on, cleaner might miss to 
take care of cleaning. So, we fixed the gap in above patch. 

Fix: Clean commit metadata will track savepointed commits. So, next time when 
clean planner runs, we find the mis-match b/w tracked savepointed commits and 
current savepoints from timeline and if there is a difference, cleaner will 
account for partittions touched by the savepointed commit.  

 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 

  was:
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 


> Fix archival guarding data files not yet cleaned up by cleaner when savepoint 
> is removed
> 
>
> Key: HUDI-7490
> URL: https://issues.apache.org/jira/browse/HUDI-7490
> Project: Apache Hudi
>  

[jira] [Updated] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

2024-03-07 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7490:
--
Description: 
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

In most common deployment models, where we recommend to stop the pipeline while 
doing savepoint and restore or deleting savepoint, this might be uncommon. but 
there is a chance that this could happen. 

 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 

  was:
We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 


> Fix archival guarding data files not yet cleaned up by cleaner when savepoint 
> is removed
> 
>
> Key: HUDI-7490
> URL: https://issues.apache.org/jira/browse/HUDI-7490
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: archiving, cleaning, clustering
>Reporter: sivabalan narayanan
>Priority: Major
>
> We added a fix recently where cleaner will take care of cleaning up 
> savepointed files too w/o fail with 
> [https://github.com/apache/hudi/pull/10651] 
>  
> But we might have a gap wrt archival. 
> If we ensure archival will run just after cleaning and not independently, we 
> should be good.
> but if there is a chance we could expose duplicate data to readers w/ below 
> scenario. 
>  
> lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
> files created at t5 and went past it. and say we have a replace commit at t10 
> which replaced all data files that were created at t5. 
> w/ this state, say we removed the savepoint. 
> we will have data files created by t5.commit in 

[jira] [Created] (HUDI-7490) Fix archival guarding data files not yet cleaned up by cleaner when savepoint is removed

2024-03-07 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7490:
-

 Summary: Fix archival guarding data files not yet cleaned up by 
cleaner when savepoint is removed
 Key: HUDI-7490
 URL: https://issues.apache.org/jira/browse/HUDI-7490
 Project: Apache Hudi
  Issue Type: Bug
  Components: archiving, cleaning, clustering
Reporter: sivabalan narayanan


We added a fix recently where cleaner will take care of cleaning up savepointed 
files too w/o fail with 

[https://github.com/apache/hudi/pull/10651] 

 

But we might have a gap wrt archival. 

If we ensure archival will run just after cleaning and not independently, we 
should be good.

but if there is a chance we could expose duplicate data to readers w/ below 
scenario. 

 

lets say we have a savepoint at t5.commit. So, cleaner skipped to delete the 
files created at t5 and went past it. and say we have a replace commit at t10 
which replaced all data files that were created at t5. 

w/ this state, say we removed the savepoint. 

we will have data files created by t5.commit in data directory. 

as long as t10 is in active timeline, readers will only see files written by 
t10 and will ignore files written by t5. 

at this juncture, if we run archival (w/o cleaner), archival might archive t5 
to t10. on which case both data files written by t5 and t10 will be exposed to 
readers. 

So, we have to guard the archival in this case. 

Essentially, we need to ensure before archiving a replace commit, the fileIds 
that were replaced are cleaned by the cleaner. 

 

Probable fix:

We can follow similar approach we followed in 
[https://github.com/apache/hudi/pull/10651]  . 

Essentially check for list of savepoints in current timeline and compare it w/ 
savepointed instants in latest clean commit metadata. If they match, we do not 
need to block archival. but if there is a difference (which means a savepoint 
was deleted in timeline and cleaner has not got a chance to cleanup yet), we 
should punt archiving anything and come back next time. 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7478) Fix max delta commits guard check w/ MDT

2024-03-04 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7478:
-

 Summary: Fix max delta commits guard check w/ MDT 
 Key: HUDI-7478
 URL: https://issues.apache.org/jira/browse/HUDI-7478
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


protected static void checkNumDeltaCommits(HoodieTableMetaClient metaClient, 
int maxNumDeltaCommitsWhenPending) \{
final HoodieActiveTimeline activeTimeline = 
metaClient.reloadActiveTimeline();
Option lastCompaction = 
activeTimeline.filterCompletedInstants()
.filter(s -> s.getAction().equals(COMPACTION_ACTION)).lastInstant();
int numDeltaCommits = lastCompaction.isPresent()
? 
activeTimeline.getDeltaCommitTimeline().findInstantsAfter(lastCompaction.get().getTimestamp()).countInstants()
: activeTimeline.getDeltaCommitTimeline().countInstants();
if (numDeltaCommits > maxNumDeltaCommitsWhenPending) {
  throw new HoodieMetadataException(String.format("Metadata table's 
deltacommits exceeded %d: "
  + "this is likely caused by a pending instant in the data table. 
Resolve the pending instant "
  + "or adjust `%s`, then restart the pipeline.",
  maxNumDeltaCommitsWhenPending, 
HoodieMetadataConfig.METADATA_MAX_NUM_DELTACOMMITS_WHEN_PENDING.key()));
}
  } 






Here we account for action type "compaction. But compaction completed instant 
will have "commit" as action. So, we need to fix it. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7460) Fix compaction schedule with pending delta commits

2024-02-29 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7460:
-

 Summary: Fix compaction schedule with pending delta commits
 Key: HUDI-7460
 URL: https://issues.apache.org/jira/browse/HUDI-7460
 Project: Apache Hudi
  Issue Type: Improvement
  Components: compaction
Reporter: sivabalan narayanan


Hudi has a constraint that compaction schedule can happen only if there are no 
pending delta commits whose instant time < compaction instant being scheduled. 
We were throwing exception when this condition is not met. wee should fix the 
user behavior here so that we do not throw exception and return an empty plan 
when this condition is met.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7429) Fix avg record size estimation for delta commits and replace commits

2024-02-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7429:
-

Assignee: sivabalan narayanan

> Fix avg record size estimation for delta commits and replace commits
> 
>
> Key: HUDI-7429
> URL: https://issues.apache.org/jira/browse/HUDI-7429
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> avg record size calculation only considers COMMIT for now. lets fix it to 
> include delta commit and replace commits as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7429) Fix avg record size estimation for delta commits and replace commits

2024-02-20 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7429:
-

 Summary: Fix avg record size estimation for delta commits and 
replace commits
 Key: HUDI-7429
 URL: https://issues.apache.org/jira/browse/HUDI-7429
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan


avg record size calculation only considers COMMIT for now. lets fix it to 
include delta commit and replace commits as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs

2024-02-13 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7407:
-

Assignee: sivabalan narayanan

> Add optional clean support to standalone compaction and clustering jobs
> ---
>
> Key: HUDI-7407
> URL: https://issues.apache.org/jira/browse/HUDI-7407
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Lets add top level config to standalone compaction and clustering job to 
> optionally clean. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7407) Add optional clean support to standalone compaction and clustering jobs

2024-02-13 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7407:
-

 Summary: Add optional clean support to standalone compaction and 
clustering jobs
 Key: HUDI-7407
 URL: https://issues.apache.org/jira/browse/HUDI-7407
 Project: Apache Hudi
  Issue Type: Improvement
  Components: table-service
Reporter: sivabalan narayanan


Lets add top level config to standalone compaction and clustering job to 
optionally clean. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7397) Add support to purge a clustering instant

2024-02-09 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7397:
-

 Summary: Add support to purge a clustering instant
 Key: HUDI-7397
 URL: https://issues.apache.org/jira/browse/HUDI-7397
 Project: Apache Hudi
  Issue Type: Improvement
  Components: clustering
Reporter: sivabalan narayanan


As of now, if a user made some mistake on clustering params and wishes to 
completely purge a pending clustering, we do not have any support for that. 
would be good to add the support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7320) hive-sync unexpectedly loads archived timeline

2024-01-30 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812499#comment-17812499
 ] 

sivabalan narayanan commented on HUDI-7320:
---

We did fix something on these lines already. Can you check if its reproducible 
w/ 0.14.0 as well ? 

 

> hive-sync unexpectedly loads archived timeline
> --
>
> Key: HUDI-7320
> URL: https://issues.apache.org/jira/browse/HUDI-7320
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Affects Versions: 0.13.1
>Reporter: Raymond Xu
>Priority: Critical
> Attachments: Screenshot 2024-01-16 at 5.49.25 PM.png, Screenshot 
> 2024-01-16 at 5.49.30 PM.png
>
>
> investigation shows that hive-sync step loaded archived timeline and caused 
> long delay in the overall write process. And full scan for changes in all 
> partitions is not used. need to dig further.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7331) Test and certify col stats integration with MOR table

2024-01-24 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7331:
-

 Summary: Test and certify col stats integration with MOR table
 Key: HUDI-7331
 URL: https://issues.apache.org/jira/browse/HUDI-7331
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


Lets test and certify col stats integration with MOR table for all operations.

for eg, any write operations (bulk insert, insert, upsert, insert overwrite) 
should add new entries to col stats index in metadata table. 

rollback: 

for files that were deleted should be removed from col stats (data files). 

for log files added, we should add new entries to col stats 

 

clean: 

any files deleted (data files and log files) should have the entries removed 
from col stats in MDT. 

 

Similarly, lets also do similar exercise with delete partition and other 
operations we have with hudi. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7272) Cut docs for 0.14.1

2024-01-03 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7272:
-

 Summary: Cut docs for 0.14.1
 Key: HUDI-7272
 URL: https://issues.apache.org/jira/browse/HUDI-7272
 Project: Apache Hudi
  Issue Type: Improvement
  Components: docs
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6112) Improve Doc generatiion to generate config tables for basic and advanced configs

2024-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-6112:
--
Fix Version/s: 1.1.0
   (was: 0.14.1)

> Improve Doc generatiion to generate config tables for basic and advanced 
> configs
> 
>
> Key: HUDI-6112
> URL: https://issues.apache.org/jira/browse/HUDI-6112
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> The HoodieConfigDocGenerator will need to be modified such that:
>  * Each config group has two sections: basic configs and advanced configs
>  * Basic configs and Advanced configs are played out in a table instead of a 
> serially like today.
>  * Among each of these tables the required configs are bubbled up to the top 
> of the table and highlighted.
> Add UI fixes to support a table layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6932) Fix batch size for delete partition for AWSGlueCatalogSyncClient

2024-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6932.
-
Resolution: Fixed

> Fix batch size for delete partition for AWSGlueCatalogSyncClient
> 
>
> Key: HUDI-6932
> URL: https://issues.apache.org/jira/browse/HUDI-6932
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Aditya Goenka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Github Issue - [https://github.com/apache/hudi/issues/9806]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7100) Data loss when using insert_overwrite_table with insert.drop.duplicates

2024-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7100.
-
Resolution: Fixed

> Data loss when using insert_overwrite_table with insert.drop.duplicates
> ---
>
> Key: HUDI-7100
> URL: https://issues.apache.org/jira/browse/HUDI-7100
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.4, 0.14.1, 0.13.2
>
>
> Code to reproduce - 
> Github Issue - [https://github.com/apache/hudi/issues/9967]
> ```
> schema = StructType(
> [
> StructField("id", IntegerType(), True),
> StructField("name", StringType(), True)
> ]
> )
> data = [
> Row(1, "a"),
> Row(2, "a"),
> Row(3, "c"),
> ]
> hudi_configs = {
> "hoodie.table.name": TABLE_NAME,
> "hoodie.datasource.write.recordkey.field": "name",
> "hoodie.datasource.write.precombine.field": "id",
> "hoodie.datasource.write.operation":"insert_overwrite_table",
> "hoodie.table.keygenerator.class": 
> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
> }
> df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()
> -- Showing no records
> ```
> df.write.format("org.apache.hudi").options(**hudi_configs).option("hoodie.datasource.write.insert.drop.duplicates","true").mode("append").save(PATH)
> spark.read.format("hudi").load(PATH).show()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7120) Performance improvements in deltastreamer executor code path

2024-01-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7120.
-
  Assignee: Lokesh Jain
Resolution: Fixed

> Performance improvements in deltastreamer executor code path
> 
>
> Key: HUDI-7120
> URL: https://issues.apache.org/jira/browse/HUDI-7120
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Makes improvements based on findings from CPU profiling for the executor code 
> path.
> 1. Fixes repetitive execution of string split operation
> 2. reduces number of validation calls



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6954) Corrupted column stats in metadata table in non-partitioned table

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6954.
-
  Assignee: sivabalan narayanan
Resolution: Fixed

> Corrupted column stats in metadata table in non-partitioned table
> -
>
> Key: HUDI-6954
> URL: https://issues.apache.org/jira/browse/HUDI-6954
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> After compaction in MDT, the column stats entries in metadata table for a 
> non-partitioned data table are corrupted, with wrong encoded part of 
> partition path in the key.  This makes some column stats entires not 
> searchable through the key based on column name, partition path, and the file 
> name as the key is wrong in the column stats partition in MDT. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7135) Spark reads hudi table error when flink creates the table without preCombine fields by catalog or factory

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7135:
--
Fix Version/s: 0.14.1

> Spark reads hudi table error when flink creates the table without preCombine 
> fields by catalog or factory
> -
>
> Key: HUDI-7135
> URL: https://issues.apache.org/jira/browse/HUDI-7135
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 1.0.0
>
>
> Create a table through dfs catalog, hms catalog, or sink ddl, and then query 
> the data of the table through spark, and an exception occurs:
> Java. util. NoSuchElementException: key not found: ts
> demo:
>  1. create a table through hms catalog:
> {panel:title=hms catalog create table}
> CREATE CATALOG hudi_catalog WITH(
> 'type' = 'hudi',
> 'mode' = 'hms'
> );
> CREATE TABLE hudi_catalog.`default`.ct1
> (
>   f1 string,
>   f2 string
> ) WITH (
>   'connector' = 'hudi',
>   'path' = 'file:///Users/x/x/others/data/hudi-warehouse/ct1',
>   'table.type' = 'COPY_ON_WRITE',
>   'write.operation' = 'insert'
> );
> {panel}
> 2. spark query
> {panel:title=spark query}
> select * from ct1
> {panel}
> 3. exception
> {panel:title=exception}
> java.util.NoSuchElementException: key not found: ts
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6012) delete base path when failed to run bootstrap procedure

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6012.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> delete base path when failed to run bootstrap procedure
> ---
>
> Key: HUDI-6012
> URL: https://issues.apache.org/jira/browse/HUDI-6012
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: bootstrap
>Reporter: lvyanquan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> [run_bootstrap](https://hudi.apache.org/docs/next/procedures#run_bootstrap) 
> procedure is called like this 
> {code:java}
> call run_bootstrap(table => 'test_hudi_table', table_type => 'COPY_ON_WRITE', 
> bootstrap_path => 'hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table', 
> base_path => 'hdfs://ns1//tmp/hoodie/test_hudi_table', rowKey_field => 'id', 
> partition_path_field => 'dt'); {code}
> some exceptional cases this procedure will fail, for example, bootstrap_path 
> is not existed or empty.  The  `base_path` in HDFS was still remained with 
> `.hoodie` directory.
> Though we can still rerun bootstrap procedure and pass `bootstrap_overwrite` 
> parameter, it's better to clean this path that we created after failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6094) Make Kafka send record from async to sync

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6094.
-
Fix Version/s: 1.1.0
   Resolution: Fixed

> Make Kafka send record from async to sync
> -
>
> Key: HUDI-6094
> URL: https://issues.apache.org/jira/browse/HUDI-6094
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: DuBin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> In call method from HoodieWriteCommitKafkaCallback in hudi-utilities module, 
> the kafka send is async, how about to make the send to sync to ensure the 
> kafka send call complete. No performance degradation, because the send call 
> is in a try-with-resource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7190) Spark33LegacyHoodieParquetFileFormat failed to read parquet when nested type vectorized read enable

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7190.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Spark33LegacyHoodieParquetFileFormat failed to read parquet when nested type 
> vectorized read enable
> ---
>
> Key: HUDI-7190
> URL: https://issues.apache.org/jira/browse/HUDI-7190
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Qijun Fu
>Assignee: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
> Attachments: image-2023-12-07-15-41-29-452.png
>
>
> For Spark3.3+ version, we can do vectorized read for nested columns. However 
> when 
> `spark.sql.parquet.enableNestedColumnVectorizedReader = true` and 
>   `set spark.sql.parquet.enableVectorizedReader = true` is set,  hudi will 
> throw the following exception: 
>  !image-2023-12-07-15-41-29-452.png! 
> We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7223) Hudi Cleaner removing files still required for view N hours old

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7223.
-
Fix Version/s: 0.14.1
 Assignee: Timothy Brown
   Resolution: Fixed

> Hudi Cleaner removing files still required for view N hours old
> ---
>
> Key: HUDI-7223
> URL: https://issues.apache.org/jira/browse/HUDI-7223
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> If a user is using time based cleaner policy, they will expect that they can 
> query the table state as of N hours ago. This means that they do not want to 
> clean up files older than N hours but files that are no longer relevant to 
> the table N hours ago. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7215) Delete NewHoodieParquetFileFormat and all references

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7215:
--
Status: Patch Available  (was: In Progress)

> Delete NewHoodieParquetFileFormat and all references
> 
>
> Key: HUDI-7215
> URL: https://issues.apache.org/jira/browse/HUDI-7215
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> HoodieFileGroupReaderBasedParquetFileFormat now has feature parity with 
> NewHoodieParquetFileFormat and no new work will be done on 
> NewHoodieParquetFileFormat. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7213) When using wrong tabe.type value in hudi catalog happends npe

2023-12-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7213.
-
Fix Version/s: 0.14.1
   (was: 1.0.0)
   Resolution: Fixed

>  When using wrong tabe.type value in hudi catalog happends npe
> --
>
> Key: HUDI-7213
> URL: https://issues.apache.org/jira/browse/HUDI-7213
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: jack Lei
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
>  
> {code:java}
> --创建hudi res表
> create table IF NOT EXISTS hudi_catalog.tmp.test_metric_hudi_metastore_mor_new
> (
>`id` string COMMENT 'id',
>`name` string COMMENT '名称',
>`pmerge` string COMMENT '合并字段',
> dt string
> ) PARTITIONED BY (dt) WITH 
> ('connector'='hudi',   
>  'table.type'='MERGE_ON_WRITE',
>  'write.operation'='insert'); {code}
> table.type is wrong 
> then appears
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieCatalogException: Failed to create 
> table tmp.test_metric_hudi_metastore_mor_newat 
> org.apache.hudi.table.catalog.HoodieHiveCatalog.createTable(HoodieHiveCatalog.java:480)
> at 
> org.apache.flink.table.catalog.CatalogManager.lambda$createTable$10(CatalogManager.java:661)
> at 
> org.apache.flink.table.catalog.CatalogManager.execute(CatalogManager.java:841)
> ... 22 moreCaused by: java.lang.NullPointerExceptionat 
> java.util.HashMap.merge(HashMap.java:1225)at 
> java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)at 
> java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1699)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)  
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.hudi.table.catalog.TableOptionProperties.translateFlinkTableProperties2Spark(TableOptionProperties.java:191)
> at 
> org.apache.hudi.table.catalog.HoodieHiveCatalog.instantiateHiveTable(HoodieHiveCatalog.java:610)
> at 
> org.apache.hudi.table.catalog.HoodieHiveCatalog.createTable(HoodieHiveCatalog.java:469)
> ... 24 more {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5760) Make sure DeleteBlock doesn't use Kryo for serialization to disk

2023-12-13 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-5760.
-
Fix Version/s: 0.14.0
   (was: 0.14.1)
   Resolution: Fixed

> Make sure DeleteBlock doesn't use Kryo for serialization to disk
> 
>
> Key: HUDI-5760
> URL: https://issues.apache.org/jira/browse/HUDI-5760
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 1.0.0-beta1
>Reporter: Alexey Kudinkin
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The problem is that serialization of the `HoodieDeleteBlock` is generated 
> dynamically by Kryo that could change whenever any class comprising it 
> changes.
> We've been bitten by this already twice:
> HUDI-5758
> HUDI-4959
>  
> Instead, anything that is persisted on disk have to be serialized using 
> hard-coded methods (same way HoodieDataBlock are serailized)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7228) Close LogFileReaders eaglerly with LogRecordReader

2023-12-13 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7228:
-

 Summary: Close LogFileReaders eaglerly with LogRecordReader
 Key: HUDI-7228
 URL: https://issues.apache.org/jira/browse/HUDI-7228
 Project: Apache Hudi
  Issue Type: Improvement
  Components: reader-core
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7206) Fix auto deletion of MDT

2023-12-10 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7206.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Fix auto deletion of MDT
> 
>
> Key: HUDI-7206
> URL: https://issues.apache.org/jira/browse/HUDI-7206
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> With 014.0, we are triggering deletion of mdt and updating hoodie.properties 
> even if its already disabled and not present. 
>  
> {code:java}
> private boolean shouldExecuteMetadataTableDeletion() {
>   // Only execute metadata table deletion when all the following conditions 
> are met
>   // (1) This is data table
>   // (2) Metadata table is disabled in HoodieWriteConfig for the writer
>   return !metaClient.isMetadataTable()
>   && !config.isMetadataTableEnabled();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7206) Fix auto deletion of MDT

2023-12-09 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7206:
-

 Summary: Fix auto deletion of MDT
 Key: HUDI-7206
 URL: https://issues.apache.org/jira/browse/HUDI-7206
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


With 014.0, we are triggering deletion of mdt and updating hoodie.properties 
even if its already disabled and not present. 

 
{code:java}
private boolean shouldExecuteMetadataTableDeletion() {
  // Only execute metadata table deletion when all the following conditions are 
met
  // (1) This is data table
  // (2) Metadata table is disabled in HoodieWriteConfig for the writer
  return !metaClient.isMetadataTable()
  && !config.isMetadataTableEnabled();
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7205) Optimize MDT table deletion

2023-12-09 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7205:
-

 Summary: Optimize MDT table deletion
 Key: HUDI-7205
 URL: https://issues.apache.org/jira/browse/HUDI-7205
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


Hudi tries to honor MDT disablement as part of every write. But the deletion is 
triggered everytime even if the table does not exist and all configs are 
already disabled. 

 

This results in updating hoodie.properties repeatedly and can run into 
concurrency issues. 

 
{code:java}
23/12/07 04:34:32 ERROR DagScheduler: Exception executing node
org.apache.hudi.exception.HoodieIOException: Error updating table configs.
        at 
org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:445)
        at 
org.apache.hudi.common.table.HoodieTableConfig.update(HoodieTableConfig.java:454)
        at 
org.apache.hudi.common.table.HoodieTableConfig.setMetadataPartitionState(HoodieTableConfig.java:780)
        at 
org.apache.hudi.common.table.HoodieTableConfig.clearMetadataPartitions(HoodieTableConfig.java:811)
        at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:1412)
        at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.deleteMetadataTable(HoodieTableMetadataUtil.java:289)
        at 
org.apache.hudi.table.HoodieTable.maybeDeleteMetadataTable(HoodieTable.java:953)
        at 
org.apache.hudi.table.HoodieSparkTable.getMetadataWriter(HoodieSparkTable.java:116)
        at 
org.apache.hudi.table.HoodieTable.getMetadataWriter(HoodieTable.java:905)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:360)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:286)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:236)
        at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:104){code}
{code:java}
        at 
org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:826)
        at 
org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:450)
        at 
org.apache.hudi.integ.testsuite.HoodieDeltaStreamerWrapper.upsert(HoodieDeltaStreamerWrapper.java:48)
        at 
org.apache.hudi.integ.testsuite.HoodieDeltaStreamerWrapper.insert(HoodieDeltaStreamerWrapper.java:52)
        at 
org.apache.hudi.integ.testsuite.HoodieInlineTestSuiteWriter.insert(HoodieInlineTestSuiteWriter.java:111)
        at 
org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.ingest(InsertNode.java:70)
        at 
org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:53)
        at 
org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:135)
        at 
org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:104)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
s3a://jenkins-infra-hudi/hudi/job-run/HudiIntegTestsDeltastreamerAsyncManualEKS/data/2023-12-07/30/MERGE_ON_READdeltastreamer-non-partitioned.yamltest-nonpartitioned.properties/91/output/.hoodie/hoodie.properties
 already exists
        at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:813)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195){code}
{code:java}
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1064)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$create$2(HoodieWrapperFileSystem.java:238)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114)
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.create(HoodieWrapperFileSystem.java:237)
        at 
org.apache.hudi.common.table.HoodieTableConfig.recoverIfNeeded(HoodieTableConfig.java:389)
        at 
org.apache.hudi.common.table.HoodieTableConfig.modify(HoodieTableConfig.java:410)
        ... 26 more
23/12/07 04:34:32 INFO DagScheduler: Forcing shutdown of executor service, this 
might kill running tasks
23/12/07 04:34:32 ERROR HoodieTestSuiteJob: Failed to run Test Suite 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieIOException: Error updating table configs.
        at 

[jira] [Created] (HUDI-7199) Optimize instantsAsStream in HoodieDefaultTimeline

2023-12-08 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7199:
-

 Summary: Optimize instantsAsStream in HoodieDefaultTimeline
 Key: HUDI-7199
 URL: https://issues.apache.org/jira/browse/HUDI-7199
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan


Optimize instantsAsStream in HoodieDefaultTimeline



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7188) Master is failing due to test failure Dec 6, 2023

2023-12-06 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7188:
-

 Summary: Master is failing due to test failure Dec 6, 2023
 Key: HUDI-7188
 URL: https://issues.apache.org/jira/browse/HUDI-7188
 Project: Apache Hudi
  Issue Type: Improvement
  Components: tests-ci
Reporter: sivabalan narayanan


AFter this patch, master is broken 

[https://github.com/apache/hudi/pull/9667]

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7187) Fix integ test props to honor new streamer properties

2023-12-06 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7187:
-

 Summary: Fix integ test props to honor new streamer properties 
 Key: HUDI-7187
 URL: https://issues.apache.org/jira/browse/HUDI-7187
 Project: Apache Hudi
  Issue Type: Improvement
  Components: tests-ci
Reporter: sivabalan narayanan


As of now, all integ tests properties file are holding deltastreamer props. we 
need to change them to streamer props. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7051) Incorrect replace operation in compaction strategy filter

2023-12-05 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793494#comment-17793494
 ] 

sivabalan narayanan commented on HUDI-7051:
---

hey [~vmaster] : 
sorry I am bit confused. 

as per master, filterPartitionPaths in DayBasedCompactionStrategy is as below 



 
{code:java}
@Override
public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
  return allPartitionPaths.stream().sorted(comparator)
  .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),
  writeConfig.getTargetPartitionsPerDayBasedCompaction()));
} {code}
 

 

Only in 

BoundedPartitionAwareCompactionStrategy.filterPartitionPaths I see the replace 
operations. 

But can you help me understand whats the issue in there. I understand 
"dllr_date=2023/10/10" may not be an actual partition present physcially, but 
thats interim state used for comparison and later we switch it back. 

 

in other words. 

if original partition is hypehnated. 

 

dllr_date=2023-10-10 -> gets converted to "dllr_date=2023/10/10", and then 
comparisons are performed to sort them and then converted back to 
dllr_date=2023-10-10. So, not sure where is the bug here. can you throw some 
light please

 

> Incorrect replace operation in compaction strategy filter
> -
>
> Key: HUDI-7051
> URL: https://issues.apache.org/jira/browse/HUDI-7051
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction
>Reporter: vmaster.cc
>Priority: Major
> Attachments: image-2023-11-08-16-01-46-166.png, 
> image-2023-11-08-16-02-39-291.png
>
>
> There are some incorrect replace operation to sort all partition paths.
> {code:java}
> return allPartitionPaths.stream().map(partition -> partition.replace("/", 
> "-"))
> .sorted(Comparator.reverseOrder()).map(partitionPath -> 
> partitionPath.replace("-", "/")) {code}
> the hive partition before replace is dllr_date=2023-10-10, then after will 
> convert to dllr_date=2023/10/10, this is an incorrect partition.
>  # org.apache.hudi.table.action.compact.strategy.DayBasedCompactionStrategy
>  # 
> org.apache.hudi.table.action.compact.strategy.BoundedPartitionAwareCompactionStrategy
>  # 
> org.apache.hudi.table.action.compact.strategy.UnBoundedPartitionAwareCompactionStrategy
> !image-2023-11-08-16-02-39-291.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch

2023-12-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7154:
-

Assignee: sivabalan narayanan

> Hudi Streamer with row writer enabled hits NPE with empty batch
> ---
>
> Key: HUDI-7154
> URL: https://issues.apache.org/jira/browse/HUDI-7154
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Hudi Streamer with row writer enabled hits NPE with empty batch (the 
> checkpoint has advanced)
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458)
>   at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch

2023-12-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7154.
-
Resolution: Fixed

> Hudi Streamer with row writer enabled hits NPE with empty batch
> ---
>
> Key: HUDI-7154
> URL: https://issues.apache.org/jira/browse/HUDI-7154
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Hudi Streamer with row writer enabled hits NPE with empty batch (the 
> checkpoint has advanced)
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819)
>   at 
> org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458)
>   at 
> org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6980) Spark job stuck after completion, due to some non daemon threads still running

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-6980:
-

Assignee: sivabalan narayanan

> Spark job stuck after completion, due to some non daemon threads still 
> running 
> ---
>
> Key: HUDI-6980
> URL: https://issues.apache.org/jira/browse/HUDI-6980
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Github Issue - [https://github.com/apache/hudi/issues/9826]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6958) Update Schema Evolution Documentation

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6958.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Update Schema Evolution Documentation
> -
>
> Key: HUDI-6958
> URL: https://issues.apache.org/jira/browse/HUDI-6958
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Update the schema evolution page to document the new changes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6961) Deletes with custom delete field not working with DefaultHoodieRecordPayload

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6961.
-
Resolution: Fixed

> Deletes with custom delete field not working with DefaultHoodieRecordPayload
> 
>
> Key: HUDI-6961
> URL: https://issues.apache.org/jira/browse/HUDI-6961
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When configuring custom delete key and delete marker with 
> DefaultHoodieRecordPayload, writing fails with the deletes in the batch:
> {code:java}
> Error for key:HoodieKey { recordKey=0 partitionPath=} is 
> java.util.NoSuchElementException: No value present in Option
>   at org.apache.hudi.common.util.Option.get(Option.java:89)
>   at 
> org.apache.hudi.common.model.HoodieAvroRecord.prependMetaFields(HoodieAvroRecord.java:132)
>   at 
> org.apache.hudi.io.HoodieCreateHandle.doWrite(HoodieCreateHandle.java:144)
>   at 
> org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:180)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:98)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:42)
>   at 
> org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:69)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80)
>   at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39)
>   at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1418)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1482)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1305)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6979) support EventTimeBasedCompactionStrategy

2023-11-30 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791711#comment-17791711
 ] 

sivabalan narayanan commented on HUDI-6979:
---

this will definitely be a good addition

 

> support EventTimeBasedCompactionStrategy
> 
>
> Key: HUDI-6979
> URL: https://issues.apache.org/jira/browse/HUDI-6979
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction
>Reporter: Kong Wei
>Assignee: Kong Wei
>Priority: Major
>
> The current compaction strategies are based on the logfile size, the number 
> of logfile files, etc. The data time of the RO table generated by these 
> strategies is uncontrollable. Hudi also has a DayBased strategy, but it 
> relies on day based partition path and the time granularity is coarse.
> The *EventTimeBasedCompactionStrategy* strategy can generate event 
> time-friendly RO tables, whether it is day based partition or not. For 
> example, the strategy can select all logfiles whose data time is before 3 am 
> for compaction, so that the generated RO table data is before 3 am. If we 
> just want to query data before 3 am, we can just query the RO table which is 
> much faster.
> With the strategy, I think we can expand the application scenarios of RO 
> tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6999) Add row writer support to Deltastreamer

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-6999.
-
Fix Version/s: 1.1.0
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Add row writer support to Deltastreamer
> ---
>
> Key: HUDI-6999
> URL: https://issues.apache.org/jira/browse/HUDI-6999
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> We have not yet leveraged row writer support in Deltastreamer. we can benefit 
> from perf improvement if we can integrate



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7000) Fix HoodieActiveTimeline::deleteInstantFileIfExists not show the file path when occur delete not success

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7000.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Fix HoodieActiveTimeline::deleteInstantFileIfExists not show the file path 
> when occur delete not success
> 
>
> Key: HUDI-7000
> URL: https://issues.apache.org/jira/browse/HUDI-7000
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Fix HoodieActiveTimeline::deleteInstantFileIfExists not show the file path 
> when occur delete not success
> when some instants delete is not success,only report the failed instant 
> without the path,but users need the path to get more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7002) Add support for non-partitioned dataset w/ RLI

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7002.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Add support for non-partitioned dataset w/ RLI
> --
>
> Key: HUDI-7002
> URL: https://issues.apache.org/jira/browse/HUDI-7002
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> We need to support RLI w/ non-partitioned datasets as well. 
> both initialization of RLI for an existing table and for new tables



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7006) reduce unnecessary isEmpty checks in StreamSync

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7006.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> reduce unnecessary isEmpty checks in StreamSync
> ---
>
> Key: HUDI-7006
> URL: https://issues.apache.org/jira/browse/HUDI-7006
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7004) Add support of snapshotLoadQuerySplitter in s3/gcs sources

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7004.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

> Add support of snapshotLoadQuerySplitter in s3/gcs sources
> --
>
> Key: HUDI-7004
> URL: https://issues.apache.org/jira/browse/HUDI-7004
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7003) Add option to fallback to full table scan for s3/gcs sources

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7003.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Add option to fallback to full table scan for s3/gcs sources
> 
>
> Key: HUDI-7003
> URL: https://issues.apache.org/jira/browse/HUDI-7003
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7009) Filter out null value records from avro kafka source

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7009.
-
Resolution: Fixed

> Filter out null value records from avro kafka source
> 
>
> Key: HUDI-7009
> URL: https://issues.apache.org/jira/browse/HUDI-7009
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7014) Follow up HUDI-6975, optimize the code of BoundedPartitionAwareCompactionStrategy

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7014.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Not A Problem

> Follow up HUDI-6975, optimize the code of 
> BoundedPartitionAwareCompactionStrategy
> -
>
> Key: HUDI-7014
> URL: https://issues.apache.org/jira/browse/HUDI-7014
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kwang
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: compaction, pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7054) ShowPartitionsCommand should consider lazy delete_partitions

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7054.
-
Fix Version/s: 0.14.1
 Assignee: Hui An
   Resolution: Fixed

> ShowPartitionsCommand should consider lazy delete_partitions
> 
>
> Key: HUDI-7054
> URL: https://issues.apache.org/jira/browse/HUDI-7054
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7052) Fix partition key validation for key generators.

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7052.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Fix partition key validation for key generators.
> 
>
> Key: HUDI-7052
> URL: https://issues.apache.org/jira/browse/HUDI-7052
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7083) Support multiple table scraping w/ prometheus reporter

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7083.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Support multiple table scraping w/ prometheus reporter
> --
>
> Key: HUDI-7083
> URL: https://issues.apache.org/jira/browse/HUDI-7083
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metrics
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7084) Handle schema retrieval for hudi table w/ empty commits

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7084.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Handle schema retrieval for hudi table w/ empty commits
> ---
>
> Key: HUDI-7084
> URL: https://issues.apache.org/jira/browse/HUDI-7084
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7086) Scale GCS event source to consume large no of msgs from queue

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7086.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Scale GCS event source to consume large no of msgs from queue
> -
>
> Key: HUDI-7086
> URL: https://issues.apache.org/jira/browse/HUDI-7086
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> We should be able to consume 100 or more messages in one batch for GCS event 
> source. 
>  
> If apis has some limitations, lets invoke multiple times before ingesting to 
> hudi. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7095) Perf fixes to Json serde

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7095.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Perf fixes to Json serde 
> -
>
> Key: HUDI-7095
> URL: https://issues.apache.org/jira/browse/HUDI-7095
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> We could impr perf of ObjectMapper usage. 
> And some objects are instantiated (Boolean Type ref) for every request. We 
> could avoid such repeated instantiations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7097) Handle the way hms Uri is instantiated w/ HiveSyncTool

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7097.
-
Fix Version/s: 0.14.1
 Assignee: sivabalan narayanan
   Resolution: Fixed

> Handle the way hms Uri is instantiated w/ HiveSyncTool
> --
>
> Key: HUDI-7097
> URL: https://issues.apache.org/jira/browse/HUDI-7097
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> In cur set up, we dont' account for mis-match b/w provided hadoop conf and 
> user provided uris. We need to compare both and set appropriate values
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7095) Perf fixes to Json serde

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7095:
-

Assignee: sivabalan narayanan

> Perf fixes to Json serde 
> -
>
> Key: HUDI-7095
> URL: https://issues.apache.org/jira/browse/HUDI-7095
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> We could impr perf of ObjectMapper usage. 
> And some objects are instantiated (Boolean Type ref) for every request. We 
> could avoid such repeated instantiations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7098) Add max bytes per partition w/ cloud store incr source

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7098.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Add max bytes per partition w/ cloud store incr source
> --
>
> Key: HUDI-7098
> URL: https://issues.apache.org/jira/browse/HUDI-7098
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7103) Enable Time travel queries for COW

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7103.
-
Resolution: Fixed

> Enable Time travel queries for COW
> --
>
> Key: HUDI-7103
> URL: https://issues.apache.org/jira/browse/HUDI-7103
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This goal of this task is to enable time travel queries for COW tables based 
> on HadoopFsRelation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7098) Add max bytes per partition w/ cloud store incr source

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7098:
-

Assignee: sivabalan narayanan

> Add max bytes per partition w/ cloud store incr source
> --
>
> Key: HUDI-7098
> URL: https://issues.apache.org/jira/browse/HUDI-7098
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7112) Allow reuse of timeline server across tables

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7112.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Allow reuse of timeline server across tables
> 
>
> Key: HUDI-7112
> URL: https://issues.apache.org/jira/browse/HUDI-7112
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When a user is running multiple writers in the same JVM, there will currently 
> be a javelin server created per table. This leads to unnecessary overhead 
> since the timeline server can support multiple basepaths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7108) Ensure schema is refreshed for every batch when using KafkaAvroSchemaDeserializer

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7108.
-
Fix Version/s: 0.14.1
 Assignee: Rajesh Mahindra
   Resolution: Fixed

> Ensure schema is refreshed for every batch when using 
> KafkaAvroSchemaDeserializer
> -
>
> Key: HUDI-7108
> URL: https://issues.apache.org/jira/browse/HUDI-7108
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> In Kafka AVRO source, the schema config for KafkaAvroSchemaDeserializer is 
> only set during constructor and not refreshed for every batch. In 
> Deltastreamer cont. mode, this creates an issue when schema evolves for the 
> kafka messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7106) Fix SQS deletes logic for S3 events source.

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7106.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Fix SQS deletes logic for S3 events source.
> ---
>
> Key: HUDI-7106
> URL: https://issues.apache.org/jira/browse/HUDI-7106
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Please fix a few things:
>  * SQS source should send delete API if no delete messages.
>  * Do not close deltasync service within ingestOnce
>  * Ensure the same catalog sync class is not called twice
>  * Error table failure strategy has default value set



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7115) Add more options for BigQuery Sync

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7115.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Add more options for BigQuery Sync
> --
>
> Key: HUDI-7115
> URL: https://issues.apache.org/jira/browse/HUDI-7115
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> There are options for requiring a partition filter and adding a big lake 
> connection ID to leverage some new access control features that users may 
> want to leverage in their environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7127) Fix closure of Spark context in tests

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7127.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Fix closure of Spark context in tests
> -
>
> Key: HUDI-7127
> URL: https://issues.apache.org/jira/browse/HUDI-7127
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> For the past3 days, we are seeing CI tests failing due to spark context not 
> properly shut down. 
>  
> Ref: 
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/21029/logs/28]
>  
>  
> stacktrace:
> {code:java}
> 2023-11-21T00:34:10.1321683Z [INFO] Tests run: 4, Failures: 0, Errors: 0, 
> Skipped: 0, Time elapsed: 12.375 s - in 
> org.apache.hudi.TestHoodieParquetBloomFilter
> 2023-11-21T00:34:10.1327664Z [INFO] Running org.apache.hudi.util.TestPathUtils
> 2023-11-21T00:34:10.2175283Z [INFO] Tests run: 1, Failures: 0, Errors: 0, 
> Skipped: 0, Time elapsed: 0.081 s - in org.apache.hudi.util.TestPathUtils
> 2023-11-21T00:34:10.2256379Z [INFO] Running 
> org.apache.hudi.io.storage.row.TestHoodieRowCreateHandle
> 2023-11-21T00:34:14.7733707Z [ERROR] Tests run: 5, Failures: 0, Errors: 5, 
> Skipped: 0, Time elapsed: 4.53 s <<< FAILURE! - in 
> org.apache.hudi.io.storage.row.TestHoodieRowCreateHandle
> 2023-11-21T00:34:14.7735064Z [ERROR] testInstantiationFailure{boolean}[1]  
> Time elapsed: 1.619 s  <<< ERROR!
> 2023-11-21T00:34:14.7743733Z org.apache.spark.SparkException: 
> 2023-11-21T00:34:14.7744752Z Only one SparkContext should be running in this 
> JVM (see SPARK-2243).The currently running SparkContext was created at:
> 2023-11-21T00:34:14.7745761Z 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
> 2023-11-21T00:34:14.7746644Z 
> org.apache.hudi.TestHoodieParquetBloomFilter.initSparkContext(TestHoodieParquetBloom.scala:47)
> 2023-11-21T00:34:14.7747562Z 
> org.apache.hudi.TestHoodieParquetBloomFilter.setUp(TestHoodieParquetBloom.scala:57)
> 2023-11-21T00:34:14.7748262Z 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2023-11-21T00:34:14.7748971Z 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2023-11-21T00:34:14.7749798Z 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2023-11-21T00:34:14.7750464Z java.lang.reflect.Method.invoke(Method.java:498)
> 2023-11-21T00:34:14.7751108Z 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> 2023-11-21T00:34:14.7752199Z 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> 2023-11-21T00:34:14.7753349Z 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> 2023-11-21T00:34:14.7754508Z 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
> 2023-11-21T00:34:14.7755527Z 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptLifecycleMethod(TimeoutExtension.java:126)
> 2023-11-21T00:34:14.7756642Z 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptBeforeEachMethod(TimeoutExtension.java:76)
> 2023-11-21T00:34:14.7758217Z 
> org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
> 2023-11-21T00:34:14.7759419Z 
> org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
> 2023-11-21T00:34:14.7760646Z 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
> 2023-11-21T00:34:14.7761922Z 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> 2023-11-21T00:34:14.7763079Z 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
> 2023-11-21T00:34:14.7764245Z 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
> 2023-11-21T00:34:14.7765256Z 
> org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
> 2023-11-21T00:34:14.7766227Z  at 
> org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
> 2023-11-21T00:34:14.7766950Z  at scala.Option.foreach(Option.scala:407)
> 2023-11-21T00:34:14.7767574Z  at 
> org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
> 2023-11-21T00:34:14.7768453Z  at 
> 

[jira] [Closed] (HUDI-7138) Fix instantiation issues with ErrorTableWriter and Schema Registry Provider

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7138.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Fix instantiation issues with ErrorTableWriter and Schema Registry Provider
> ---
>
> Key: HUDI-7138
> URL: https://issues.apache.org/jira/browse/HUDI-7138
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Two important fixes
> - Ensure ErrorTable class is serializable
> - Fix schema registry provider when schema converter class is NOT configured



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7160) Avro Schema Properties are dropped when adding Hoodie Metadata columns

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7160.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Avro Schema Properties are dropped when adding Hoodie Metadata columns
> --
>
> Key: HUDI-7160
> URL: https://issues.apache.org/jira/browse/HUDI-7160
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When we add the metadata columns to an existing avro schema, the properties 
> set on that schema are dropped. We should allow these properties to be 
> carried through.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7161) Add commit action type and ext ra metadata to write callback on commit message

2023-11-30 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-7161.
-
Fix Version/s: 0.14.1
   Resolution: Fixed

> Add commit action type and ext ra metadata to write callback on commit message
> --
>
> Key: HUDI-7161
> URL: https://issues.apache.org/jira/browse/HUDI-7161
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> Add commit action type and ext ra metadata to write callback on commit message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6933) bulk_insert Fails if one of the composite key contains null

2023-11-29 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791365#comment-17791365
 ] 

sivabalan narayanan commented on HUDI-6933:
---

https://github.com/apache/hudi/pull/10214

> bulk_insert Fails if one of the composite key contains null
> ---
>
> Key: HUDI-6933
> URL: https://issues.apache.org/jira/browse/HUDI-6933
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.1
>
>
> Github Issue- [https://github.com/apache/hudi/issues/9799]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-6933) bulk_insert Fails if one of the composite key contains null

2023-11-29 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-6933:
-

Assignee: sivabalan narayanan

> bulk_insert Fails if one of the composite key contains null
> ---
>
> Key: HUDI-6933
> URL: https://issues.apache.org/jira/browse/HUDI-6933
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Aditya Goenka
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.1
>
>
> Github Issue- [https://github.com/apache/hudi/issues/9799]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >