Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-04-18 Thread via GitHub


nsivabalan commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1571806302


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {

Review Comment:
   gotcha. makes sense. will address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7641] Adding metadata enablement metrics [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11053:
URL: https://github.com/apache/hudi/pull/11053#issuecomment-2065773139

   
   ## CI report:
   
   * 3f7d727e83f05cb5ce7f9a3da2bfffca72686345 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23359)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]

2024-04-18 Thread via GitHub


the-other-tim-brown commented on code in PR #10763:
URL: https://github.com/apache/hudi/pull/10763#discussion_r1571784969


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicLong;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * Util class to assist with fetching average record size.
+ */
+public class AverageRecordSizeUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(AverageRecordSizeUtils.class);
+
+  /**
+   * Obtains the average record size based on records written during previous 
commits. Used for estimating how many
+   * records pack into one file.
+   */
+  static long averageBytesPerRecord(HoodieTimeline commitTimeline, 
HoodieWriteConfig hoodieWriteConfig) {
+long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate();
+long fileSizeThreshold = (long) 
(hoodieWriteConfig.getRecordSizeEstimationThreshold() * 
hoodieWriteConfig.getParquetSmallFileLimit());
+try {

Review Comment:
   @nsivabalan I think this try/catch should be done at the instant parsing 
level (line 59). If there is only a single failure to read the commit metadata 
then we should still attempt to use the other commits' metadata



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7641] Adding metadata enablement metrics [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11053:
URL: https://github.com/apache/hudi/pull/11053#issuecomment-2065732707

   
   ## CI report:
   
   * 3f7d727e83f05cb5ce7f9a3da2bfffca72686345 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10976:
URL: https://github.com/apache/hudi/pull/10976#issuecomment-2065732532

   
   ## CI report:
   
   * 641e4e1885d174370cc7a4e438cc67a486a36b04 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23358)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #10976:
URL: https://github.com/apache/hudi/pull/10976#discussion_r1571765687


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
 this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getWriteTimeline();
+this.timelineHashAndPendingReplaceInstants = null;
+  }
+
+  /**
+   * Get a list of pending replace instants. Caches the result for the active 
timeline.
+   * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is 
called.
+   *
+   * @return list of pending replace instant timestamps
+   */
+  private List getPendingReplaceInstants() {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();

Review Comment:
   The cache is located in `HoodieDefaultTimeline`, both variable `instants` 
and `instantTimeSet` are lazy initialized cache.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7641:
-
Labels: pull-request-available  (was: )

> Add metrics to track what partitions are enabled in MDT
> ---
>
> Key: HUDI-7641
> URL: https://issues.apache.org/jira/browse/HUDI-7641
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7641] Adding metadata enablement metrics [hudi]

2024-04-18 Thread via GitHub


nsivabalan opened a new pull request, #11053:
URL: https://github.com/apache/hudi/pull/11053

   ### Change Logs
   
   Adding metrics to track mdt partitions enabled. 
   
   ### Impact
   
   Easier for feature rollout when enabling new partitions in MDT.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7641) Add metrics to track what partitions are enabled in MDT

2024-04-18 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7641:
-

 Summary: Add metrics to track what partitions are enabled in MDT
 Key: HUDI-7641
 URL: https://issues.apache.org/jira/browse/HUDI-7641
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #11052:
URL: https://github.com/apache/hudi/pull/11052#discussion_r1571751773


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath 
path,
 
   if (content.isPresent() && needTempFile) {
 StoragePath parent = path.getParent();
-tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX);
+tmpPath = new StoragePath(parent, path.getName() + "." + 
UUID.randomUUID());

Review Comment:
   The method itself would take care of the file deletion, the original logic 
also has this concen, and to be worse, if a corrupt tmp file already exists, 
the file creation would never succeed. That is the best we can do here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #11052:
URL: https://github.com/apache/hudi/pull/11052#discussion_r1571748491


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath 
path,
 
   if (content.isPresent() && needTempFile) {
 StoragePath parent = path.getParent();
-tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX);
+tmpPath = new StoragePath(parent, path.getName() + "." + 
UUID.randomUUID());
 fsout = create(tmpPath, false);

Review Comment:
   Here is the Hadoop filesystem atomicity guarantees: 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/introduction.html#Core_Expectations_of_a_Hadoop_Compatible_FileSystem
   
   For file creation, if the overwrite parameter is false, the check and 
creation MUST be atomic. Here we do not hold exclusive access lock as the 
invoker, a random suffix would eliminate the requirement for lock because the 
tmp file creation would never conflict. And the renaming is itself atomic, so 
we can somehow ensure the atomocity of the file creation on HDFS.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7498) Fix schema for HoodieTimestampAwareParquetInputFormat

2024-04-18 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7498:
--
Fix Version/s: 0.15.0

> Fix schema for HoodieTimestampAwareParquetInputFormat
> -
>
> Key: HUDI-7498
> URL: https://issues.apache.org/jira/browse/HUDI-7498
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> HoodieTimestampAwareParquetInputFormat constructs record reader using 
> HoodieAvroParquetReader which fetches schema from the parquet file in the 
> input split. It ignores hive ordering as RealtimeRecordReader does. It 
> results in ordering of fields being incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11052:
URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065711219

   
   ## CI report:
   
   * 65e618892aee85f4d0ac97b21d2f2eec8a98446b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23354)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065711149

   
   ## CI report:
   
   * 6d5468f8ee2c27e1178dd6cb6c6cacd2d965b136 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23355)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065711107

   
   ## CI report:
   
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   * de51f5efb052c32725b5eeb97773133d8c98498f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23356)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10886:
URL: https://github.com/apache/hudi/pull/10886#issuecomment-2065710966

   
   ## CI report:
   
   * 7b04755aa308766f3b0f0d5292ed9476630da90d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23357)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #11052:
URL: https://github.com/apache/hudi/pull/11052#discussion_r1571748491


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath 
path,
 
   if (content.isPresent() && needTempFile) {
 StoragePath parent = path.getParent();
-tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX);
+tmpPath = new StoragePath(parent, path.getName() + "." + 
UUID.randomUUID());
 fsout = create(tmpPath, false);

Review Comment:
   Here is the Hadoop filesystem atomicity guarantees: 
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/introduction.html#Core_Expectations_of_a_Hadoop_Compatible_FileSystem



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


boneanxs commented on code in PR #11052:
URL: https://github.com/apache/hudi/pull/11052#discussion_r1571747302


##
hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java:
##
@@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath 
path,
 
   if (content.isPresent() && needTempFile) {
 StoragePath parent = path.getParent();
-tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX);
+tmpPath = new StoragePath(parent, path.getName() + "." + 
UUID.randomUUID());

Review Comment:
   A bit worry here, is it possible many callers calling this method to create 
the same file and leaving many tmp files un cleaned, for example, these callers 
are suddenly OOM after create these tmp files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10886:
URL: https://github.com/apache/hudi/pull/10886#issuecomment-2065675635

   
   ## CI report:
   
   * 522a68cb3ea8dc725418eb9b811a03b5c86c694b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23343)
 
   * 7b04755aa308766f3b0f0d5292ed9476630da90d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23357)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10976:
URL: https://github.com/apache/hudi/pull/10976#issuecomment-2065675751

   
   ## CI report:
   
   * db99bbcc7ede1bb1372a7996c25cfb54c1069a49 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23144)
 
   * 641e4e1885d174370cc7a4e438cc67a486a36b04 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23358)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065675798

   
   ## CI report:
   
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   * d2f4d099595879917fbefa3bc467e37be5ec4f24 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23353)
 
   * de51f5efb052c32725b5eeb97773133d8c98498f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23356)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-18 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065674083

   we want to start re-acquiring data from the first record of the upstream 
Hudi table and rebuild the downstream table, but the issue is that we can't 
access older data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-18 Thread via GitHub


xuzifu666 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1571706791


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   Emm,but don't know the reason for handling false logic in ParquetFileReader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]

2024-04-18 Thread via GitHub


xuzifu666 commented on code in PR #11040:
URL: https://github.com/apache/hudi/pull/11040#discussion_r1571706791


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java:
##
@@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws 
IOException {
 this.itr = RecordIterators.getParquetRecordIterator(
 internalSchemaManager,
 utcTimestamp,
-true,
+caseSensetive,

Review Comment:
   Emm,but don't know the reason for handle false logic in ParquetFileReader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065671156

   
   ## CI report:
   
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   * d2f4d099595879917fbefa3bc467e37be5ec4f24 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23353)
 
   * de51f5efb052c32725b5eeb97773133d8c98498f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10886:
URL: https://github.com/apache/hudi/pull/10886#issuecomment-2065670991

   
   ## CI report:
   
   * 522a68cb3ea8dc725418eb9b811a03b5c86c694b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23343)
 
   * 7b04755aa308766f3b0f0d5292ed9476630da90d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10976:
URL: https://github.com/apache/hudi/pull/10976#issuecomment-2065671114

   
   ## CI report:
   
   * db99bbcc7ede1bb1372a7996c25cfb54c1069a49 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23144)
 
   * 641e4e1885d174370cc7a4e438cc67a486a36b04 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065666227

   
   ## CI report:
   
   * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351)
 
   * 6d5468f8ee2c27e1178dd6cb6c6cacd2d965b136 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23355)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11052:
URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065666353

   
   ## CI report:
   
   * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352)
 
   * 65e618892aee85f4d0ac97b21d2f2eec8a98446b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23354)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065666160

   
   ## CI report:
   
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   * d2f4d099595879917fbefa3bc467e37be5ec4f24 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23353)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-18 Thread via GitHub


wecharyu commented on code in PR #10886:
URL: https://github.com/apache/hudi/pull/10886#discussion_r1571697843


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java:
##
@@ -92,37 +94,33 @@ public int getPartitionDepth() {
 
   /**
* Write the metadata safely into partition atomically.
+   * To avoid concurrent write into the same partition (for example in 
speculative case),
+   * please make sure writeToken is unique.
*/
-  public void trySave(int taskPartitionId) {
+  public void trySave(String writeToken) throws IOException {
 String extension = getMetafileExtension();
-Path tmpMetaPath =
-new Path(partitionPath, 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + 
taskPartitionId + extension);
 Path metaPath = new Path(partitionPath, 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + extension);
-boolean metafileExists = false;
 
-try {
-  metafileExists = fs.exists(metaPath);
-  if (!metafileExists) {
-// write to temporary file
-writeMetafile(tmpMetaPath);
-// move to actual path
-fs.rename(tmpMetaPath, metaPath);
-  }
-} catch (IOException ioe) {
-  LOG.warn("Error trying to save partition metadata (this is okay, as long 
as at least 1 of these succeeded), "
-  + partitionPath, ioe);
-} finally {
-  if (!metafileExists) {
-try {
-  // clean up tmp file, if still lying around
-  if (fs.exists(tmpMetaPath)) {
-fs.delete(tmpMetaPath, false);
+// This retry mechanism enables an exit-fast in metaPath exists check, 
which avoid the
+// tasks failures when there are two or more tasks trying to create the 
same metaPath.
+RetryHelper  retryHelper = new RetryHelper(1000, 3, 
1000, IOException.class.getName())
+.tryWith(() -> {
+  if (!fs.exists(metaPath)) {
+if (format.isPresent()) {
+  Path tmpMetaPath = new Path(partitionPath, 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + writeToken + 
extension);
+  writeMetafileInFormat(metaPath, tmpMetaPath, format.get());
+} else {
+  // Backwards compatible properties file format
+  try (ByteArrayOutputStream os = new ByteArrayOutputStream()) {
+props.store(os, "partition metadata");
+Option content = Option.of(os.toByteArray());
+HadoopFSUtils.createImmutableFileInPath(fs, metaPath, content, 
true, "_" + writeToken);

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]

2024-04-18 Thread via GitHub


the-other-tim-brown commented on code in PR #10976:
URL: https://github.com/apache/hudi/pull/10976#discussion_r1571665073


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, 
HoodieTimeline visibleActi
*/
   protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) {
 this.visibleCommitsAndCompactionTimeline = 
visibleActiveTimeline.getWriteTimeline();
+this.timelineHashAndPendingReplaceInstants = null;
+  }
+
+  /**
+   * Get a list of pending replace instants. Caches the result for the active 
timeline.
+   * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is 
called.
+   *
+   * @return list of pending replace instant timestamps
+   */
+  private List getPendingReplaceInstants() {
+HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline();

Review Comment:
   @danny0405 I looked at the `ActiveTimeline` class and the `DefaultTimeline` 
class but don't see any instances where there is a cache of instants for a 
particular action type. Can you point me in the right direction?
   
   For this change, it is important that it is a `Set` in the end so we 
don't need to keep recreating this set on each iteration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11052:
URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065630212

   
   ## CI report:
   
   * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352)
 
   * 65e618892aee85f4d0ac97b21d2f2eec8a98446b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23354)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065630041

   
   ## CI report:
   
   * fe5ed81020fb8d974c306f61a222f9583e2dab29 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23209)
 
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   * d2f4d099595879917fbefa3bc467e37be5ec4f24 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065630105

   
   ## CI report:
   
   * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351)
 
   * 6d5468f8ee2c27e1178dd6cb6c6cacd2d965b136 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


the-other-tim-brown commented on code in PR #11008:
URL: https://github.com/apache/hudi/pull/11008#discussion_r1571650091


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java:
##
@@ -89,10 +89,13 @@ protected List getSmallFiles(String 
partitionPath) {
   private List getSmallFileCandidates(String partitionPath, 
HoodieInstant latestCommitInstant) {
 // If we can index log files, we can add more inserts to log files for 
fileIds NOT including those under
 // pending compaction
+Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice))
+.thenComparing(FileSlice::getFileId);
 if (table.getIndex().canIndexLogFiles()) {
   return table.getSliceView()
   .getLatestFileSlicesBeforeOrOn(partitionPath, 
latestCommitInstant.getTimestamp(), false)
   .filter(this::isSmallFile)
+  .sorted(comparator)
   .collect(Collectors.toList());

Review Comment:
   I've been trying but do not understand why these tests are coupled to small 
file handling. I think that the approach to testing is a bit strange to rely on 
that feature to test something in delta streamer for example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11052:
URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065624283

   
   ## CI report:
   
   * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352)
 
   * 65e618892aee85f4d0ac97b21d2f2eec8a98446b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11001:
URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065623781

   
   ## CI report:
   
   * fe5ed81020fb8d974c306f61a222f9583e2dab29 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23209)
 
   * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #11008:
URL: https://github.com/apache/hudi/pull/11008#discussion_r1571644343


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java:
##
@@ -89,10 +89,13 @@ protected List getSmallFiles(String 
partitionPath) {
   private List getSmallFileCandidates(String partitionPath, 
HoodieInstant latestCommitInstant) {
 // If we can index log files, we can add more inserts to log files for 
fileIds NOT including those under
 // pending compaction
+Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice))
+.thenComparing(FileSlice::getFileId);
 if (table.getIndex().canIndexLogFiles()) {
   return table.getSliceView()
   .getLatestFileSlicesBeforeOrOn(partitionPath, 
latestCommitInstant.getTimestamp(), false)
   .filter(this::isSmallFile)
+  .sorted(comparator)
   .collect(Collectors.toList());

Review Comment:
   Looks like the test failure takes some time to fix, is it easy to make the 
tests deterministic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #11047:
URL: https://github.com/apache/hudi/pull/11047#discussion_r1571641639


##
hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java:
##
@@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws 
IOException {
   }
 
   @Override
-  public List listDirectEntries(StoragePath path) throws 
IOException {
+  public List listDirectory(StoragePath path) throws 
IOException {

Review Comment:
   Yeah, that's good point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065618201

   It should work for option `'read.start-commit'='earliest',`, what is the 
current behavior now, comsuming from the latest commit or a very specific one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #10886:
URL: https://github.com/apache/hudi/pull/10886#discussion_r1571623276


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java:
##
@@ -92,11 +92,12 @@ public int getPartitionDepth() {
 
   /**
* Write the metadata safely into partition atomically.
+   * To avoid concurrent write into the same partition (for example in 
speculative case),
+   * please make sure writeToken is unique.
*/
-  public void trySave(int taskPartitionId) {
+  public void trySave(String writeToken) throws IOException {

Review Comment:
   I have fied a fix for the random suffix, 
https://github.com/apache/hudi/pull/11052, with this change, I think we can get 
rid of the `writeToken` param that is passing around. We can use the 
`RetryHelper` for the file creation like this:
   
   ```java
   RetryHelper.doRetry {
 if (file does not exist) {
  storage. createImmutableFileInPath(file, content).
 }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


the-other-tim-brown commented on code in PR #11008:
URL: https://github.com/apache/hudi/pull/11008#discussion_r1571633643


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java:
##
@@ -89,10 +89,13 @@ protected List getSmallFiles(String 
partitionPath) {
   private List getSmallFileCandidates(String partitionPath, 
HoodieInstant latestCommitInstant) {
 // If we can index log files, we can add more inserts to log files for 
fileIds NOT including those under
 // pending compaction
+Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice))
+.thenComparing(FileSlice::getFileId);
 if (table.getIndex().canIndexLogFiles()) {
   return table.getSliceView()
   .getLatestFileSlicesBeforeOrOn(partitionPath, 
latestCommitInstant.getTimestamp(), false)
   .filter(this::isSmallFile)
+  .sorted(comparator)
   .collect(Collectors.toList());

Review Comment:
   @danny0405 the test failing is flakey. It is try to test some spark 
exception but it is non-deterministic. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7640:

Reviewers: Ethan Guo, Hui An  (was: Hui An)

> Uses UUID as temporary file suffix for 
> HoodieWrapperFileSystem.createImmutableFileInPath
> 
>
> Key: HUDI-7640
> URL: https://issues.apache.org/jira/browse/HUDI-7640
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7625) Avoid unnecessary rewrite for metadata table

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7625.
---
Resolution: Fixed

> Avoid unnecessary rewrite for metadata table
> 
>
> Key: HUDI-7625
> URL: https://issues.apache.org/jira/browse/HUDI-7625
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #10886:
URL: https://github.com/apache/hudi/pull/10886#discussion_r1571623276


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java:
##
@@ -92,11 +92,12 @@ public int getPartitionDepth() {
 
   /**
* Write the metadata safely into partition atomically.
+   * To avoid concurrent write into the same partition (for example in 
speculative case),
+   * please make sure writeToken is unique.
*/
-  public void trySave(int taskPartitionId) {
+  public void trySave(String writeToken) throws IOException {

Review Comment:
   I have fied a fix for the random suffix, 
https://github.com/apache/hudi/pull/11052



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11052:
URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065575462

   
   ## CI report:
   
   * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11052:
URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065570145

   
   ## CI report:
   
   * 569e14e31d4b352dec8ef4e73c59574c70791056 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath

2024-04-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7640:
-
Labels: pull-request-available  (was: )

> Uses UUID as temporary file suffix for 
> HoodieWrapperFileSystem.createImmutableFileInPath
> 
>
> Key: HUDI-7640
> URL: https://issues.apache.org/jira/browse/HUDI-7640
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath [hudi]

2024-04-18 Thread via GitHub


danny0405 opened a new pull request, #11052:
URL: https://github.com/apache/hudi/pull/11052

   ### Change Logs
   
   Always use UUID as the temporary file suffix so that the method can be 
thread-safe.
   Also moves the method to `HadoopFSUtils` as a static utility method.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065564649

   
   ## CI report:
   
   * 89078f34a2dafff26d47d8a201a59d8bf8a540ba Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23350)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath

2024-04-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7640:
-
Status: In Progress  (was: Open)

> Uses UUID as temporary file suffix for 
> HoodieWrapperFileSystem.createImmutableFileInPath
> 
>
> Key: HUDI-7640
> URL: https://issues.apache.org/jira/browse/HUDI-7640
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath

2024-04-18 Thread Danny Chen (Jira)
Danny Chen created HUDI-7640:


 Summary: Uses UUID as temporary file suffix for 
HoodieWrapperFileSystem.createImmutableFileInPath
 Key: HUDI-7640
 URL: https://issues.apache.org/jira/browse/HUDI-7640
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.15.0, 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath

2024-04-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7640:
-
Sprint: Sprint 2024-03-25

> Uses UUID as temporary file suffix for 
> HoodieWrapperFileSystem.createImmutableFileInPath
> 
>
> Key: HUDI-7640
> URL: https://issues.apache.org/jira/browse/HUDI-7640
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on code in PR #10886:
URL: https://github.com/apache/hudi/pull/10886#discussion_r1571564604


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java:
##
@@ -92,37 +94,33 @@ public int getPartitionDepth() {
 
   /**
* Write the metadata safely into partition atomically.
+   * To avoid concurrent write into the same partition (for example in 
speculative case),
+   * please make sure writeToken is unique.
*/
-  public void trySave(int taskPartitionId) {
+  public void trySave(String writeToken) throws IOException {
 String extension = getMetafileExtension();
-Path tmpMetaPath =
-new Path(partitionPath, 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + 
taskPartitionId + extension);
 Path metaPath = new Path(partitionPath, 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + extension);
-boolean metafileExists = false;
 
-try {
-  metafileExists = fs.exists(metaPath);
-  if (!metafileExists) {
-// write to temporary file
-writeMetafile(tmpMetaPath);
-// move to actual path
-fs.rename(tmpMetaPath, metaPath);
-  }
-} catch (IOException ioe) {
-  LOG.warn("Error trying to save partition metadata (this is okay, as long 
as at least 1 of these succeeded), "
-  + partitionPath, ioe);
-} finally {
-  if (!metafileExists) {
-try {
-  // clean up tmp file, if still lying around
-  if (fs.exists(tmpMetaPath)) {
-fs.delete(tmpMetaPath, false);
+// This retry mechanism enables an exit-fast in metaPath exists check, 
which avoid the
+// tasks failures when there are two or more tasks trying to create the 
same metaPath.
+RetryHelper  retryHelper = new RetryHelper(1000, 3, 
1000, IOException.class.getName())
+.tryWith(() -> {
+  if (!fs.exists(metaPath)) {
+if (format.isPresent()) {
+  Path tmpMetaPath = new Path(partitionPath, 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + writeToken + 
extension);
+  writeMetafileInFormat(metaPath, tmpMetaPath, format.get());
+} else {
+  // Backwards compatible properties file format
+  try (ByteArrayOutputStream os = new ByteArrayOutputStream()) {
+props.store(os, "partition metadata");
+Option content = Option.of(os.toByteArray());
+HadoopFSUtils.createImmutableFileInPath(fs, metaPath, content, 
true, "_" + writeToken);

Review Comment:
   Can we eliminate the writeToken and just use the 
`HadoopFSUtils.createImmutableFileInPath` to write the files directly? You may 
need to refactorer the method `HadoopFSUtils.createImmutableFileInPath` a 
little to use UUID as the temporal suffix too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7515) Fix partition metadata write failure

2024-04-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7515:
-
Status: Patch Available  (was: In Progress)

> Fix partition metadata write failure
> 
>
> Key: HUDI-7515
> URL: https://issues.apache.org/jira/browse/HUDI-7515
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: screenshot-1.png
>
>
> Avoid failing to write partition metadata. When spark.speculation is enabled, 
> if the write metadata operation become slow for some reason, a speculative 
> will be started to write the same metadata file concurrently.
> In HDFS, two tasks(like one is speculate task) writing to the same file could 
> both throw exception like so:
> {code:bash}
> File does not exist: 
> /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode 
> 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any 
> open files.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7515) Fix partition metadata write failure

2024-04-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7515:
-
Sprint: Sprint 2024-03-25

> Fix partition metadata write failure
> 
>
> Key: HUDI-7515
> URL: https://issues.apache.org/jira/browse/HUDI-7515
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: screenshot-1.png
>
>
> Avoid failing to write partition metadata. When spark.speculation is enabled, 
> if the write metadata operation become slow for some reason, a speculative 
> will be started to write the same metadata file concurrently.
> In HDFS, two tasks(like one is speculate task) writing to the same file could 
> both throw exception like so:
> {code:bash}
> File does not exist: 
> /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode 
> 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any 
> open files.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7515) Fix partition metadata write failure

2024-04-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7515:
-
Status: In Progress  (was: Open)

> Fix partition metadata write failure
> 
>
> Key: HUDI-7515
> URL: https://issues.apache.org/jira/browse/HUDI-7515
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: screenshot-1.png
>
>
> Avoid failing to write partition metadata. When spark.speculation is enabled, 
> if the write metadata operation become slow for some reason, a speculative 
> will be started to write the same metadata file concurrently.
> In HDFS, two tasks(like one is speculate task) writing to the same file could 
> both throw exception like so:
> {code:bash}
> File does not exist: 
> /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode 
> 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any 
> open files.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7515) Fix partition metadata write failure

2024-04-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7515:
-
Fix Version/s: 0.15.0
   1.0.0

> Fix partition metadata write failure
> 
>
> Key: HUDI-7515
> URL: https://issues.apache.org/jira/browse/HUDI-7515
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
> Attachments: screenshot-1.png
>
>
> Avoid failing to write partition metadata. When spark.speculation is enabled, 
> if the write metadata operation become slow for some reason, a speculative 
> will be started to write the same metadata file concurrently.
> In HDFS, two tasks(like one is speculate task) writing to the same file could 
> both throw exception like so:
> {code:bash}
> File does not exist: 
> /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode 
> 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any 
> open files.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7515) Fix partition metadata write failure

2024-04-18 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen reassigned HUDI-7515:


Assignee: Danny Chen

> Fix partition metadata write failure
> 
>
> Key: HUDI-7515
> URL: https://issues.apache.org/jira/browse/HUDI-7515
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wechar
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png
>
>
> Avoid failing to write partition metadata. When spark.speculation is enabled, 
> if the write metadata operation become slow for some reason, a speculative 
> will be started to write the same metadata file concurrently.
> In HDFS, two tasks(like one is speculate task) writing to the same file could 
> both throw exception like so:
> {code:bash}
> File does not exist: 
> /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode 
> 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any 
> open files.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] Flink-Hudi - Upsert into the same Hudi table via two different Flink pipelines (stream and batch) [hudi]

2024-04-18 Thread via GitHub


danny0405 commented on issue #10914:
URL: https://github.com/apache/hudi/issues/10914#issuecomment-2065542560

   You may need to read this doc first: 
https://www.yuque.com/yuzhao-my9fz/kb/flqll8?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065517817

   
   ## CI report:
   
   * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065473833

   
   ## CI report:
   
   * e5a2713d07581824214bcc7b9321e3d1cb371c02 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23322)
 
   * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065473691

   
   ## CI report:
   
   * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349)
 
   * 89078f34a2dafff26d47d8a201a59d8bf8a540ba Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23350)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11008:
URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065467861

   
   ## CI report:
   
   * e5a2713d07581824214bcc7b9321e3d1cb371c02 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23322)
 
   * 82029e70eec8c77e1c64bf9f751200c6962777ec UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065467751

   
   ## CI report:
   
   * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349)
 
   * 89078f34a2dafff26d47d8a201a59d8bf8a540ba UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065415574

   
   ## CI report:
   
   * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink-Hudi - Upsert into the same Hudi table via two different Flink pipelines (stream and batch) [hudi]

2024-04-18 Thread via GitHub


ChiehFu commented on issue #10914:
URL: https://github.com/apache/hudi/issues/10914#issuecomment-2065368620

   In addition, I found some duplicates written by my bulk_insert batch job 1 
and upsert stream job 2 (the one that had index bootstrap enabled).
   
   For bulk_insert batch job, it had `write.precombine` set to `true` so there 
shouldn't be any duplicates in the result table?  
   
   For upsert stream job, it had `write.precombine` set to `true` and index 
bootstrap task had parallelism set to `480`. I found this previous issue 
https://github.com/apache/hudi/issues/4881 which suggests duplicates can happen 
when index bootstrap task parallelism > 1. Is that still the case in Hudi 
0.14.1? The table that needs to be index bootstrapped is large so I am not sure 
if setting parallelism to `1` would work.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065344992

   
   ## CI report:
   
   * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348)
 
   * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065335907

   
   ## CI report:
   
   * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348)
 
   * 96a371f7fca39943737731bd18b9e52af37955e8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7588) Replace hadoop Configuration with StorageConfiguration in hudi-common module

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7588:

Status: In Progress  (was: Open)

> Replace hadoop Configuration with StorageConfiguration in hudi-common module
> 
>
> Key: HUDI-7588
> URL: https://issues.apache.org/jira/browse/HUDI-7588
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-18 Thread via GitHub


yihua closed pull request #10980: [HUDI-7578] Avoid unnecessary rewriting when 
copy old data from old base to new base file to improve compaction performance
URL: https://github.com/apache/hudi/pull/10980


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-18 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am 
trying to fix here. 
   
   lets say timeline is as follows
   
   dc1
   dc2
   rc3
   dc4
   clean5 // cleans up data files from dc1 and dc2 since it was replaced by 
rc3. 
   
   as per master, HoodieCDCExtractor goes over commit metadata in 
activetimeline and tries to deduce base files for log files found. In this 
case, all data files from dc1 and dc2 are already deleted by clean5. but 
HoodieCDCExtractor tries to parse data files from dc1 and dc2 and so we might 
hit file not found issue as per master. 
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]

2024-04-18 Thread via GitHub


nsivabalan commented on code in PR #10915:
URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java:
##
@@ -114,6 +114,24 @@ public Map> 
extractCDCFileSplits() {
 ValidationUtils.checkState(commits != null, "Empty commits");
 
 Map> fgToCommitChanges = new 
HashMap<>();
+

Review Comment:
   nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am 
trying to fix here. 
   
   lets say timeline is as follows
   
   dc1
   dc2
   rc3
   dc4
   clean5 // cleans up data files from dc1 and dc2 since it was replaced by 
rc3. 
   
   as per master, HoodieCDCExtractor goes over commit metadata in 
activetimeline and tries to deduce base files for log files found. In this 
case, all data files from dc1 and dc2 are already deleted by clean5. And so we 
might hit file not found issue as per master. 
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2064993735

   
   ## CI report:
   
   * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2064969766

   
   ## CI report:
   
   * 5f8a1f1f175c99c5f1fb36c46de04cee1eaab88e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23347)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2064968952

   
   ## CI report:
   
   * 72e09f67466fcfa61b0ec555fc2eecfa52fbb856 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23285)
 
   * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7635.
---
Resolution: Fixed

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7636) Make StoragePath Serializable

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7636.
---
Resolution: Fixed

> Make StoragePath Serializable
> -
>
> Key: HUDI-7636
> URL: https://issues.apache.org/jira/browse/HUDI-7636
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7637) Make StoragePathInfo Comparable

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-7637.
---
Resolution: Fixed

> Make StoragePathInfo Comparable
> ---
>
> Key: HUDI-7637
> URL: https://issues.apache.org/jira/browse/HUDI-7637
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7639:

Sprint: Sprint 2024-03-25

> Refactor HoodieFileIndex so that different indexes can be used via optimizer 
> rules
> --
>
> Key: HUDI-7639
> URL: https://issues.apache.org/jira/browse/HUDI-7639
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently, `HoodieFileIndex` is responsible for partition pruning as well as 
> file skipping. All indexes are being used in 
> [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333]
>  method through if-else branches. This is not only hard to maintain as we add 
> more indexes, but also induces a static hierarchy. Instead, we need more 
> flexibility so that we can alter logical plan based on availability of 
> indexes. For partition pruning in Spark, we already have 
> [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40]
>  rule but it is injected during the operator optimization batch and it does 
> not modify the result of the LogicalPlan. To be fully extensible, we should 
> be able to rewrite the LogicalPlan. We should be able to inject rules after 
> partition pruning after the operator optimization batch and before any CBO 
> rules that depend on stats. Spark provides 
> [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304]
>  API to do so, however it is only available in Spark 3.1.0 onwards.
> The goal of this ticket is to refactor index hierarchy and create new rules 
> such that Spark version < 3.1.0 still go via the old path, while later 
> versions can modify the plan using an appropriate index and inject as a 
> pre-CBO rule.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7633:

Sprint: Sprint 2024-03-25

> Use try with resources for AutoCloseable
> 
>
> Key: HUDI-7633
> URL: https://issues.apache.org/jira/browse/HUDI-7633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7635:

Sprint: Sprint 2024-03-25

> Add default block size and openSeekable APIs to HoodieStorage
> -
>
> Key: HUDI-7635
> URL: https://issues.apache.org/jira/browse/HUDI-7635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7638) Add metrics to HoodieStorage implementation that is not hadoop-dependent

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7638:

Sprint: Sprint 2024-03-25

> Add metrics to HoodieStorage implementation that is not hadoop-dependent
> 
>
> Key: HUDI-7638
> URL: https://issues.apache.org/jira/browse/HUDI-7638
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7637) Make StoragePathInfo Comparable

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7637:

Sprint: Sprint 2024-03-25

> Make StoragePathInfo Comparable
> ---
>
> Key: HUDI-7637
> URL: https://issues.apache.org/jira/browse/HUDI-7637
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7634:

Sprint: Sprint 2024-03-25

> Rename HoodieStorage APIs
> -
>
> Key: HUDI-7634
> URL: https://issues.apache.org/jira/browse/HUDI-7634
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> getHoodieStorage -> getStorage
> listDirectEntries -> listDirectory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7636) Make StoragePath Serializable

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7636:

Sprint: Sprint 2024-03-25

> Make StoragePath Serializable
> -
>
> Key: HUDI-7636
> URL: https://issues.apache.org/jira/browse/HUDI-7636
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6497) Replace FileSystem, Path, and FileStatus usage in hudi-common

2024-04-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6497.
---
Resolution: Fixed

> Replace FileSystem, Path, and FileStatus usage in hudi-common
> -
>
> Key: HUDI-6497
> URL: https://issues.apache.org/jira/browse/HUDI-6497
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: hoodie-storage, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [Inquiry] Does HoodieIndexer can Do Indexing for RLI Async Fashion [hudi]

2024-04-18 Thread via GitHub


soumilshah1995 commented on issue #10815:
URL: https://github.com/apache/hudi/issues/10815#issuecomment-2064849319

   Thanks for heads up guys 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2064823328

   
   ## CI report:
   
   * 98f4d4d4b61df443ca8c46078d921919f83e8595 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23324)
 
   * 5f8a1f1f175c99c5f1fb36c46de04cee1eaab88e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23347)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2064822612

   
   ## CI report:
   
   * 72e09f67466fcfa61b0ec555fc2eecfa52fbb856 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23285)
 
   * 24be89663d1b95cf7db83dd39378a675a54b98fc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-04-18 Thread via GitHub


hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2064800265

   
   ## CI report:
   
   * 98f4d4d4b61df443ca8c46078d921919f83e8595 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23324)
 
   * 5f8a1f1f175c99c5f1fb36c46de04cee1eaab88e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]

2024-04-18 Thread via GitHub


nsivabalan commented on PR #11047:
URL: https://github.com/apache/hudi/pull/11047#issuecomment-2064791962

   #getHoodieStorage -> #getStorage
   since this is in tests, I am ok. if it had been in source code, we should 
align method name w/ the class name. 
   
   ok w/ the patch. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1571170811


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -242,7 +252,44 @@ protected Pair, Schema> 
getRecordsIterator(HoodieDataBlock d
 } else {
   blockRecordsIterator = dataBlock.getEngineRecordIterator(readerContext);
 }
-return Pair.of(blockRecordsIterator, dataBlock.getSchema());
+Option, Schema>> schemaEvolutionTransformerOpt =

Review Comment:
   Just the buffers for now. 
   It's possible. IDK what the perf hit would be to read the file footer twice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1571167483


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -706,6 +709,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
   }
 
   test("Test schema auto evolution") {
+//This test will be flakey for mor until [HUDI-6798] is landed and we can 
set the merge mode

Review Comment:
   Looks like this test was fixed when the default payload was changed. Now it 
should work fine with the fg reader



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1571162253


##
hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/spark/execution/datasources/parquet/TestHoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -34,63 +32,17 @@ class TestHoodieFileGroupReaderBasedParquetFileFormat 
extends SparkClientFunctio
   IsNotNull("non_key_column"),
   EqualTo("non_key_column", 1)
 )
-val filtersWithoutKeyColumn = 
HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters(
+val filtersWithoutKeyColumn = 
SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters(
   filters, "key_column");
 assertEquals(0, filtersWithoutKeyColumn.size)
 
 val filtersWithKeys = Seq(
   EqualTo("key_column", 1),
   GreaterThan("non_key_column", 2)
 )
-val filtersWithKeyColumn = 
HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters(
+val filtersWithKeyColumn = 
SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters(
   filtersWithKeys, "key_column")
 assertEquals(1, filtersWithKeyColumn.size)
 assertEquals("key_column", filtersWithKeyColumn.head.references.head)
   }
-
-  @Test
-  def testGetAppliedRequiredSchema(): Unit = {
-val fields = Array(
-  StructField("column_a", LongType, nullable = false),
-  StructField("column_b", StringType, nullable = false))
-val requiredSchema = StructType(fields)
-
-val appliedSchema: StructType = 
HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema(
-  requiredSchema, shouldUseRecordPosition = true, "row_index")
-if (HoodieSparkUtils.gteqSpark3_5) {
-  assertEquals(3, appliedSchema.fields.length)
-} else {
-  assertEquals(2, appliedSchema.fields.length)
-}
-
-val schemaWithoutRowIndexColumn = 
HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema(
-  requiredSchema, shouldUseRecordPosition = false, "row_index")
-assertEquals(2, schemaWithoutRowIndexColumn.fields.length)
-  }
-
-  @Test
-  def testGetAppliedFilters(): Unit = {
-val filters = Seq(

Review Comment:
   No. They don't make sense anymore.



##
hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/spark/execution/datasources/parquet/TestHoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -34,63 +32,17 @@ class TestHoodieFileGroupReaderBasedParquetFileFormat 
extends SparkClientFunctio
   IsNotNull("non_key_column"),
   EqualTo("non_key_column", 1)
 )
-val filtersWithoutKeyColumn = 
HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters(
+val filtersWithoutKeyColumn = 
SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters(
   filters, "key_column");
 assertEquals(0, filtersWithoutKeyColumn.size)
 
 val filtersWithKeys = Seq(
   EqualTo("key_column", 1),
   GreaterThan("non_key_column", 2)
 )
-val filtersWithKeyColumn = 
HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters(
+val filtersWithKeyColumn = 
SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters(
   filtersWithKeys, "key_column")
 assertEquals(1, filtersWithKeyColumn.size)
 assertEquals("key_column", filtersWithKeyColumn.head.references.head)
   }
-
-  @Test
-  def testGetAppliedRequiredSchema(): Unit = {
-val fields = Array(
-  StructField("column_a", LongType, nullable = false),
-  StructField("column_b", StringType, nullable = false))
-val requiredSchema = StructType(fields)
-
-val appliedSchema: StructType = 
HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema(
-  requiredSchema, shouldUseRecordPosition = true, "row_index")
-if (HoodieSparkUtils.gteqSpark3_5) {
-  assertEquals(3, appliedSchema.fields.length)
-} else {
-  assertEquals(2, appliedSchema.fields.length)
-}
-
-val schemaWithoutRowIndexColumn = 
HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema(
-  requiredSchema, shouldUseRecordPosition = false, "row_index")
-assertEquals(2, schemaWithoutRowIndexColumn.fields.length)
-  }
-
-  @Test
-  def testGetAppliedFilters(): Unit = {
-val filters = Seq(

Review Comment:
   No. They don't make sense anymore to have.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1571163565


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -138,6 +138,7 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
   spark.sessionState.catalog.dropTable(TableIdentifier(tableName), 
true, true)
   spark.sessionState.catalog.refreshTable(TableIdentifier(tableName))
   
spark.sessionState.conf.unsetConf(DataSourceWriteOptions.SPARK_SQL_INSERT_INTO_OPERATION.key)
+  spark.sessionState.conf.unsetConf("spark.sql.storeAssignmentPolicy")

Review Comment:
   this config was leaking between tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1571161187


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -129,20 +138,15 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   file.partitionValues match {
 // Snapshot or incremental queries.
 case fileSliceMapping: HoodiePartitionFileSliceMapping =>
-  val filePath = 
sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file)
-  val filegroupName = if (FSUtils.isLogFile(filePath)) {
-FSUtils.getFileId(filePath.getName).substring(1)
-  } else {
-FSUtils.getFileId(filePath.getName)
-  }
+  val filegroupName = FSUtils.getFileIdFromFilePath(sparkAdapter
+.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file))
   fileSliceMapping.getSlice(filegroupName) match {
 case Some(fileSlice) if !isCount =>
   if (requiredSchema.isEmpty && 
!fileSlice.getLogFiles.findAny().isPresent) {
 val hoodieBaseFile = fileSlice.getBaseFile.get()
 
baseFileReader(createPartitionedFile(fileSliceMapping.getPartitionValues, 
hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen))
   } else {
-val readerContext: HoodieReaderContext[InternalRow] = new 
SparkFileFormatInternalRowReaderContext(
-  readerMaps)
+val readerContext = new 
SparkFileFormatInternalRowReaderContext(parquetFileReader.value, 
tableState.recordKeyField, filters)

Review Comment:
   We have also not moved any logic for filters to be in the fg reader. That is 
still handled just by the spark context for now.
   
   tableState.recordKeyField is used only for filter purposes.
   
   We cannot get rid of the parquetFileReader param. That contains only spark 
specific configs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1571157403


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -129,20 +138,15 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   file.partitionValues match {
 // Snapshot or incremental queries.
 case fileSliceMapping: HoodiePartitionFileSliceMapping =>
-  val filePath = 
sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file)
-  val filegroupName = if (FSUtils.isLogFile(filePath)) {
-FSUtils.getFileId(filePath.getName).substring(1)
-  } else {
-FSUtils.getFileId(filePath.getName)
-  }
+  val filegroupName = FSUtils.getFileIdFromFilePath(sparkAdapter
+.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file))
   fileSliceMapping.getSlice(filegroupName) match {
 case Some(fileSlice) if !isCount =>
   if (requiredSchema.isEmpty && 
!fileSlice.getLogFiles.findAny().isPresent) {
 val hoodieBaseFile = fileSlice.getBaseFile.get()
 
baseFileReader(createPartitionedFile(fileSliceMapping.getPartitionValues, 
hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen))
   } else {
-val readerContext: HoodieReaderContext[InternalRow] = new 
SparkFileFormatInternalRowReaderContext(
-  readerMaps)
+val readerContext = new 
SparkFileFormatInternalRowReaderContext(parquetFileReader.value, 
tableState.recordKeyField, filters)

Review Comment:
   The intention is to not have duplicate logic for creating the readers and 
then passing in a map of already created readers. Now, we create the readers on 
the executor. 
   
   Before we called buildReaderWithPartitionValues() a few times and had a map 
from schema hash to PartitionedFile => Iterator[InternalRow]
   
   Now we have a reader that we call read on the executor and can pass in the 
schema and filters that we want with the file. 
   
   We have removed the limitation that schema and filters be known in the driver



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-18 Thread via GitHub


jonvex commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1571152424


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -107,19 +112,23 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
 val dataSchema = 
StructType(tableSchema.structTypeSchema.fields.filterNot(f => 
partitionColumns.contains(f.name)))
 val outputSchema = StructType(requiredSchema.fields ++ 
partitionSchema.fields)
 spark.conf.set("spark.sql.parquet.enableVectorizedReader", 
supportBatchResult)
-val requiredSchemaWithMandatory = 
generateRequiredSchemaWithMandatory(requiredSchema, dataSchema, partitionSchema)
-val isCount = requiredSchemaWithMandatory.isEmpty
-val requiredSchemaSplits = requiredSchemaWithMandatory.fields.partition(f 
=> HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name))
-val requiredMeta = StructType(requiredSchemaSplits._1)
-val requiredWithoutMeta = StructType(requiredSchemaSplits._2)
+val isCount = requiredSchema.isEmpty && !isMOR && !isIncremental
 val augmentedHadoopConf = FSUtils.buildInlineConf(hadoopConf)
-val (baseFileReader, preMergeBaseFileReader, readerMaps, cdcFileReader) = 
buildFileReaders(
-  spark, dataSchema, partitionSchema, requiredSchema, filters, options, 
augmentedHadoopConf,
-  requiredSchemaWithMandatory, requiredWithoutMeta, requiredMeta)
+setSchemaEvolutionConfigs(augmentedHadoopConf, options)
+val baseFileReader = super.buildReaderWithPartitionValues(spark, 
dataSchema, partitionSchema, requiredSchema,
+  filters ++ requiredFilters, options, new 
Configuration(augmentedHadoopConf))
+val cdcFileReader = super.buildReaderWithPartitionValues(
+  spark,
+  tableSchema.structTypeSchema,
+  StructType(Nil),
+  tableSchema.structTypeSchema,
+  Nil,
+  options,
+  new Configuration(hadoopConf))
 
 val requestedAvroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(requiredSchema, 
sanitizedTableName)
 val dataAvroSchema = 
AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, 
sanitizedTableName)
-
+val parquetFileReader = 
spark.sparkContext.broadcast(sparkAdapter.createParquetFileReader(supportBatchResult,
 spark.sessionState.conf, options, augmentedHadoopConf))

Review Comment:
   No. The spark confs don't make it to the executors if you remember.  
Instantiating the reader just gets the value of the configs we need so that we 
can send them to the executor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >