Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]
nsivabalan commented on code in PR #10763: URL: https://github.com/apache/hudi/pull/10763#discussion_r1571806302 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.table.action.commit; + +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.config.HoodieWriteConfig; + +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Iterator; +import java.util.concurrent.atomic.AtomicLong; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * Util class to assist with fetching average record size. + */ +public class AverageRecordSizeUtils { + private static final Logger LOG = LoggerFactory.getLogger(AverageRecordSizeUtils.class); + + /** + * Obtains the average record size based on records written during previous commits. Used for estimating how many + * records pack into one file. + */ + static long averageBytesPerRecord(HoodieTimeline commitTimeline, HoodieWriteConfig hoodieWriteConfig) { +long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate(); +long fileSizeThreshold = (long) (hoodieWriteConfig.getRecordSizeEstimationThreshold() * hoodieWriteConfig.getParquetSmallFileLimit()); +try { Review Comment: gotcha. makes sense. will address it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7641] Adding metadata enablement metrics [hudi]
hudi-bot commented on PR #11053: URL: https://github.com/apache/hudi/pull/11053#issuecomment-2065773139 ## CI report: * 3f7d727e83f05cb5ce7f9a3da2bfffca72686345 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23359) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7429] Fixing average record size estimation for delta commits [hudi]
the-other-tim-brown commented on code in PR #10763: URL: https://github.com/apache/hudi/pull/10763#discussion_r1571784969 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/AverageRecordSizeUtils.java: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.table.action.commit; + +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.config.HoodieWriteConfig; + +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Iterator; +import java.util.concurrent.atomic.AtomicLong; + +import static org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION; +import static org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION; + +/** + * Util class to assist with fetching average record size. + */ +public class AverageRecordSizeUtils { + private static final Logger LOG = LoggerFactory.getLogger(AverageRecordSizeUtils.class); + + /** + * Obtains the average record size based on records written during previous commits. Used for estimating how many + * records pack into one file. + */ + static long averageBytesPerRecord(HoodieTimeline commitTimeline, HoodieWriteConfig hoodieWriteConfig) { +long avgSize = hoodieWriteConfig.getCopyOnWriteRecordSizeEstimate(); +long fileSizeThreshold = (long) (hoodieWriteConfig.getRecordSizeEstimationThreshold() * hoodieWriteConfig.getParquetSmallFileLimit()); +try { Review Comment: @nsivabalan I think this try/catch should be done at the instant parsing level (line 59). If there is only a single failure to read the commit metadata then we should still attempt to use the other commits' metadata -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7641] Adding metadata enablement metrics [hudi]
hudi-bot commented on PR #11053: URL: https://github.com/apache/hudi/pull/11053#issuecomment-2065732707 ## CI report: * 3f7d727e83f05cb5ce7f9a3da2bfffca72686345 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
hudi-bot commented on PR #10976: URL: https://github.com/apache/hudi/pull/10976#issuecomment-2065732532 ## CI report: * 641e4e1885d174370cc7a4e438cc67a486a36b04 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23358) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
danny0405 commented on code in PR #10976: URL: https://github.com/apache/hudi/pull/10976#discussion_r1571765687 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, HoodieTimeline visibleActi */ protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) { this.visibleCommitsAndCompactionTimeline = visibleActiveTimeline.getWriteTimeline(); +this.timelineHashAndPendingReplaceInstants = null; + } + + /** + * Get a list of pending replace instants. Caches the result for the active timeline. + * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is called. + * + * @return list of pending replace instant timestamps + */ + private List getPendingReplaceInstants() { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); Review Comment: The cache is located in `HoodieDefaultTimeline`, both variable `instants` and `instantTimeSet` are lazy initialized cache. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7641) Add metrics to track what partitions are enabled in MDT
[ https://issues.apache.org/jira/browse/HUDI-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7641: - Labels: pull-request-available (was: ) > Add metrics to track what partitions are enabled in MDT > --- > > Key: HUDI-7641 > URL: https://issues.apache.org/jira/browse/HUDI-7641 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7641] Adding metadata enablement metrics [hudi]
nsivabalan opened a new pull request, #11053: URL: https://github.com/apache/hudi/pull/11053 ### Change Logs Adding metrics to track mdt partitions enabled. ### Impact Easier for feature rollout when enabling new partitions in MDT. ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7641) Add metrics to track what partitions are enabled in MDT
sivabalan narayanan created HUDI-7641: - Summary: Add metrics to track what partitions are enabled in MDT Key: HUDI-7641 URL: https://issues.apache.org/jira/browse/HUDI-7641 Project: Apache Hudi Issue Type: Improvement Components: metadata Reporter: sivabalan narayanan -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
danny0405 commented on code in PR #11052: URL: https://github.com/apache/hudi/pull/11052#discussion_r1571751773 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath path, if (content.isPresent() && needTempFile) { StoragePath parent = path.getParent(); -tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX); +tmpPath = new StoragePath(parent, path.getName() + "." + UUID.randomUUID()); Review Comment: The method itself would take care of the file deletion, the original logic also has this concen, and to be worse, if a corrupt tmp file already exists, the file creation would never succeed. That is the best we can do here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
danny0405 commented on code in PR #11052: URL: https://github.com/apache/hudi/pull/11052#discussion_r1571748491 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath path, if (content.isPresent() && needTempFile) { StoragePath parent = path.getParent(); -tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX); +tmpPath = new StoragePath(parent, path.getName() + "." + UUID.randomUUID()); fsout = create(tmpPath, false); Review Comment: Here is the Hadoop filesystem atomicity guarantees: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/introduction.html#Core_Expectations_of_a_Hadoop_Compatible_FileSystem For file creation, if the overwrite parameter is false, the check and creation MUST be atomic. Here we do not hold exclusive access lock as the invoker, a random suffix would eliminate the requirement for lock because the tmp file creation would never conflict. And the renaming is itself atomic, so we can somehow ensure the atomocity of the file creation on HDFS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7498) Fix schema for HoodieTimestampAwareParquetInputFormat
[ https://issues.apache.org/jira/browse/HUDI-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7498: -- Fix Version/s: 0.15.0 > Fix schema for HoodieTimestampAwareParquetInputFormat > - > > Key: HUDI-7498 > URL: https://issues.apache.org/jira/browse/HUDI-7498 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > HoodieTimestampAwareParquetInputFormat constructs record reader using > HoodieAvroParquetReader which fetches schema from the parquet file in the > input split. It ignores hive ordering as RealtimeRecordReader does. It > results in ordering of fields being incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
hudi-bot commented on PR #11052: URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065711219 ## CI report: * 65e618892aee85f4d0ac97b21d2f2eec8a98446b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23354) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
hudi-bot commented on PR #11008: URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065711149 ## CI report: * 6d5468f8ee2c27e1178dd6cb6c6cacd2d965b136 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23355) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]
hudi-bot commented on PR #11001: URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065711107 ## CI report: * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN * de51f5efb052c32725b5eeb97773133d8c98498f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23356) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]
hudi-bot commented on PR #10886: URL: https://github.com/apache/hudi/pull/10886#issuecomment-2065710966 ## CI report: * 7b04755aa308766f3b0f0d5292ed9476630da90d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23357) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
danny0405 commented on code in PR #11052: URL: https://github.com/apache/hudi/pull/11052#discussion_r1571748491 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath path, if (content.isPresent() && needTempFile) { StoragePath parent = path.getParent(); -tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX); +tmpPath = new StoragePath(parent, path.getName() + "." + UUID.randomUUID()); fsout = create(tmpPath, false); Review Comment: Here is the Hadoop filesystem atomicity guarantees: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/introduction.html#Core_Expectations_of_a_Hadoop_Compatible_FileSystem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
boneanxs commented on code in PR #11052: URL: https://github.com/apache/hudi/pull/11052#discussion_r1571747302 ## hudi-io/src/main/java/org/apache/hudi/storage/HoodieStorage.java: ## @@ -267,7 +270,7 @@ public final void createImmutableFileInPath(StoragePath path, if (content.isPresent() && needTempFile) { StoragePath parent = path.getParent(); -tmpPath = new StoragePath(parent, path.getName() + TMP_PATH_POSTFIX); +tmpPath = new StoragePath(parent, path.getName() + "." + UUID.randomUUID()); Review Comment: A bit worry here, is it possible many callers calling this method to create the same file and leaving many tmp files un cleaned, for example, these callers are suddenly OOM after create these tmp files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]
hudi-bot commented on PR #10886: URL: https://github.com/apache/hudi/pull/10886#issuecomment-2065675635 ## CI report: * 522a68cb3ea8dc725418eb9b811a03b5c86c694b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23343) * 7b04755aa308766f3b0f0d5292ed9476630da90d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23357) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
hudi-bot commented on PR #10976: URL: https://github.com/apache/hudi/pull/10976#issuecomment-2065675751 ## CI report: * db99bbcc7ede1bb1372a7996c25cfb54c1069a49 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23144) * 641e4e1885d174370cc7a4e438cc67a486a36b04 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23358) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]
hudi-bot commented on PR #11001: URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065675798 ## CI report: * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN * d2f4d099595879917fbefa3bc467e37be5ec4f24 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23353) * de51f5efb052c32725b5eeb97773133d8c98498f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23356) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
juice411 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065674083 we want to start re-acquiring data from the first record of the upstream Hudi table and rebuild the downstream table, but the issue is that we can't access older data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
xuzifu666 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1571706791 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: Emm,but don't know the reason for handling false logic in ParquetFileReader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7627] ParquetSchema clip case-sensetive need be configurable [hudi]
xuzifu666 commented on code in PR #11040: URL: https://github.com/apache/hudi/pull/11040#discussion_r1571706791 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cow/CopyOnWriteInputFormat.java: ## @@ -130,7 +133,7 @@ public void open(FileInputSplit fileSplit) throws IOException { this.itr = RecordIterators.getParquetRecordIterator( internalSchemaManager, utcTimestamp, -true, +caseSensetive, Review Comment: Emm,but don't know the reason for handle false logic in ParquetFileReader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]
hudi-bot commented on PR #11001: URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065671156 ## CI report: * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN * d2f4d099595879917fbefa3bc467e37be5ec4f24 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23353) * de51f5efb052c32725b5eeb97773133d8c98498f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]
hudi-bot commented on PR #10886: URL: https://github.com/apache/hudi/pull/10886#issuecomment-2065670991 ## CI report: * 522a68cb3ea8dc725418eb9b811a03b5c86c694b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23343) * 7b04755aa308766f3b0f0d5292ed9476630da90d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
hudi-bot commented on PR #10976: URL: https://github.com/apache/hudi/pull/10976#issuecomment-2065671114 ## CI report: * db99bbcc7ede1bb1372a7996c25cfb54c1069a49 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23144) * 641e4e1885d174370cc7a4e438cc67a486a36b04 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
hudi-bot commented on PR #11008: URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065666227 ## CI report: * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351) * 6d5468f8ee2c27e1178dd6cb6c6cacd2d965b136 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23355) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
hudi-bot commented on PR #11052: URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065666353 ## CI report: * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352) * 65e618892aee85f4d0ac97b21d2f2eec8a98446b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23354) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]
hudi-bot commented on PR #11001: URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065666160 ## CI report: * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN * d2f4d099595879917fbefa3bc467e37be5ec4f24 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23353) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]
wecharyu commented on code in PR #10886: URL: https://github.com/apache/hudi/pull/10886#discussion_r1571697843 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java: ## @@ -92,37 +94,33 @@ public int getPartitionDepth() { /** * Write the metadata safely into partition atomically. + * To avoid concurrent write into the same partition (for example in speculative case), + * please make sure writeToken is unique. */ - public void trySave(int taskPartitionId) { + public void trySave(String writeToken) throws IOException { String extension = getMetafileExtension(); -Path tmpMetaPath = -new Path(partitionPath, HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + taskPartitionId + extension); Path metaPath = new Path(partitionPath, HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + extension); -boolean metafileExists = false; -try { - metafileExists = fs.exists(metaPath); - if (!metafileExists) { -// write to temporary file -writeMetafile(tmpMetaPath); -// move to actual path -fs.rename(tmpMetaPath, metaPath); - } -} catch (IOException ioe) { - LOG.warn("Error trying to save partition metadata (this is okay, as long as at least 1 of these succeeded), " - + partitionPath, ioe); -} finally { - if (!metafileExists) { -try { - // clean up tmp file, if still lying around - if (fs.exists(tmpMetaPath)) { -fs.delete(tmpMetaPath, false); +// This retry mechanism enables an exit-fast in metaPath exists check, which avoid the +// tasks failures when there are two or more tasks trying to create the same metaPath. +RetryHelper retryHelper = new RetryHelper(1000, 3, 1000, IOException.class.getName()) +.tryWith(() -> { + if (!fs.exists(metaPath)) { +if (format.isPresent()) { + Path tmpMetaPath = new Path(partitionPath, HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + writeToken + extension); + writeMetafileInFormat(metaPath, tmpMetaPath, format.get()); +} else { + // Backwards compatible properties file format + try (ByteArrayOutputStream os = new ByteArrayOutputStream()) { +props.store(os, "partition metadata"); +Option content = Option.of(os.toByteArray()); +HadoopFSUtils.createImmutableFileInPath(fs, metaPath, content, true, "_" + writeToken); Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7575] avoid repeated fetching of pending replace instants [hudi]
the-other-tim-brown commented on code in PR #10976: URL: https://github.com/apache/hudi/pull/10976#discussion_r1571665073 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -140,6 +141,22 @@ protected void init(HoodieTableMetaClient metaClient, HoodieTimeline visibleActi */ protected void refreshTimeline(HoodieTimeline visibleActiveTimeline) { this.visibleCommitsAndCompactionTimeline = visibleActiveTimeline.getWriteTimeline(); +this.timelineHashAndPendingReplaceInstants = null; + } + + /** + * Get a list of pending replace instants. Caches the result for the active timeline. + * The cache is invalidated when {@link #refreshTimeline(HoodieTimeline)} is called. + * + * @return list of pending replace instant timestamps + */ + private List getPendingReplaceInstants() { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); Review Comment: @danny0405 I looked at the `ActiveTimeline` class and the `DefaultTimeline` class but don't see any instances where there is a cache of instants for a particular action type. Can you point me in the right direction? For this change, it is important that it is a `Set` in the end so we don't need to keep recreating this set on each iteration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
hudi-bot commented on PR #11052: URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065630212 ## CI report: * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352) * 65e618892aee85f4d0ac97b21d2f2eec8a98446b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23354) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]
hudi-bot commented on PR #11001: URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065630041 ## CI report: * fe5ed81020fb8d974c306f61a222f9583e2dab29 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23209) * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN * d2f4d099595879917fbefa3bc467e37be5ec4f24 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
hudi-bot commented on PR #11008: URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065630105 ## CI report: * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351) * 6d5468f8ee2c27e1178dd6cb6c6cacd2d965b136 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
the-other-tim-brown commented on code in PR #11008: URL: https://github.com/apache/hudi/pull/11008#discussion_r1571650091 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java: ## @@ -89,10 +89,13 @@ protected List getSmallFiles(String partitionPath) { private List getSmallFileCandidates(String partitionPath, HoodieInstant latestCommitInstant) { // If we can index log files, we can add more inserts to log files for fileIds NOT including those under // pending compaction +Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice)) +.thenComparing(FileSlice::getFileId); if (table.getIndex().canIndexLogFiles()) { return table.getSliceView() .getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitInstant.getTimestamp(), false) .filter(this::isSmallFile) + .sorted(comparator) .collect(Collectors.toList()); Review Comment: I've been trying but do not understand why these tests are coupled to small file handling. I think that the approach to testing is a bit strange to rely on that feature to test something in delta streamer for example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieStorage.createImmutableFileInPath [hudi]
hudi-bot commented on PR #11052: URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065624283 ## CI report: * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352) * 65e618892aee85f4d0ac97b21d2f2eec8a98446b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7576] Improve efficiency of getRelativePartitionPath, reduce computation of partitionPath in AbstractTableFileSystemView [hudi]
hudi-bot commented on PR #11001: URL: https://github.com/apache/hudi/pull/11001#issuecomment-2065623781 ## CI report: * fe5ed81020fb8d974c306f61a222f9583e2dab29 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23209) * 22f01c9e071a9f92747f4af966c9f63056c7216d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
danny0405 commented on code in PR #11008: URL: https://github.com/apache/hudi/pull/11008#discussion_r1571644343 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java: ## @@ -89,10 +89,13 @@ protected List getSmallFiles(String partitionPath) { private List getSmallFileCandidates(String partitionPath, HoodieInstant latestCommitInstant) { // If we can index log files, we can add more inserts to log files for fileIds NOT including those under // pending compaction +Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice)) +.thenComparing(FileSlice::getFileId); if (table.getIndex().canIndexLogFiles()) { return table.getSliceView() .getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitInstant.getTimestamp(), false) .filter(this::isSmallFile) + .sorted(comparator) .collect(Collectors.toList()); Review Comment: Looks like the test failure takes some time to fix, is it easy to make the tests deterministic? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
danny0405 commented on code in PR #11047: URL: https://github.com/apache/hudi/pull/11047#discussion_r1571641639 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -94,7 +94,7 @@ public boolean createDirectory(StoragePath path) throws IOException { } @Override - public List listDirectEntries(StoragePath path) throws IOException { + public List listDirectory(StoragePath path) throws IOException { Review Comment: Yeah, that's good point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]
danny0405 commented on issue #11016: URL: https://github.com/apache/hudi/issues/11016#issuecomment-2065618201 It should work for option `'read.start-commit'='earliest',`, what is the current behavior now, comsuming from the latest commit or a very specific one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]
danny0405 commented on code in PR #10886: URL: https://github.com/apache/hudi/pull/10886#discussion_r1571623276 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java: ## @@ -92,11 +92,12 @@ public int getPartitionDepth() { /** * Write the metadata safely into partition atomically. + * To avoid concurrent write into the same partition (for example in speculative case), + * please make sure writeToken is unique. */ - public void trySave(int taskPartitionId) { + public void trySave(String writeToken) throws IOException { Review Comment: I have fied a fix for the random suffix, https://github.com/apache/hudi/pull/11052, with this change, I think we can get rid of the `writeToken` param that is passing around. We can use the `RetryHelper` for the file creation like this: ```java RetryHelper.doRetry { if (file does not exist) { storage. createImmutableFileInPath(file, content). } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
the-other-tim-brown commented on code in PR #11008: URL: https://github.com/apache/hudi/pull/11008#discussion_r1571633643 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java: ## @@ -89,10 +89,13 @@ protected List getSmallFiles(String partitionPath) { private List getSmallFileCandidates(String partitionPath, HoodieInstant latestCommitInstant) { // If we can index log files, we can add more inserts to log files for fileIds NOT including those under // pending compaction +Comparator comparator = Comparator.comparing(fileSlice -> getTotalFileSize(fileSlice)) +.thenComparing(FileSlice::getFileId); if (table.getIndex().canIndexLogFiles()) { return table.getSliceView() .getLatestFileSlicesBeforeOrOn(partitionPath, latestCommitInstant.getTimestamp(), false) .filter(this::isSmallFile) + .sorted(comparator) .collect(Collectors.toList()); Review Comment: @danny0405 the test failing is flakey. It is try to test some spark exception but it is non-deterministic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath
[ https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7640: Reviewers: Ethan Guo, Hui An (was: Hui An) > Uses UUID as temporary file suffix for > HoodieWrapperFileSystem.createImmutableFileInPath > > > Key: HUDI-7640 > URL: https://issues.apache.org/jira/browse/HUDI-7640 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7625) Avoid unnecessary rewrite for metadata table
[ https://issues.apache.org/jira/browse/HUDI-7625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7625. --- Resolution: Fixed > Avoid unnecessary rewrite for metadata table > > > Key: HUDI-7625 > URL: https://issues.apache.org/jira/browse/HUDI-7625 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]
danny0405 commented on code in PR #10886: URL: https://github.com/apache/hudi/pull/10886#discussion_r1571623276 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java: ## @@ -92,11 +92,12 @@ public int getPartitionDepth() { /** * Write the metadata safely into partition atomically. + * To avoid concurrent write into the same partition (for example in speculative case), + * please make sure writeToken is unique. */ - public void trySave(int taskPartitionId) { + public void trySave(String writeToken) throws IOException { Review Comment: I have fied a fix for the random suffix, https://github.com/apache/hudi/pull/11052 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath [hudi]
hudi-bot commented on PR #11052: URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065575462 ## CI report: * 569e14e31d4b352dec8ef4e73c59574c70791056 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23352) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath [hudi]
hudi-bot commented on PR #11052: URL: https://github.com/apache/hudi/pull/11052#issuecomment-2065570145 ## CI report: * 569e14e31d4b352dec8ef4e73c59574c70791056 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath
[ https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7640: - Labels: pull-request-available (was: ) > Uses UUID as temporary file suffix for > HoodieWrapperFileSystem.createImmutableFileInPath > > > Key: HUDI-7640 > URL: https://issues.apache.org/jira/browse/HUDI-7640 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7640] Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath [hudi]
danny0405 opened a new pull request, #11052: URL: https://github.com/apache/hudi/pull/11052 ### Change Logs Always use UUID as the temporary file suffix so that the method can be thread-safe. Also moves the method to `HadoopFSUtils` as a static utility method. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065564649 ## CI report: * 89078f34a2dafff26d47d8a201a59d8bf8a540ba Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23350) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath
[ https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7640: - Status: In Progress (was: Open) > Uses UUID as temporary file suffix for > HoodieWrapperFileSystem.createImmutableFileInPath > > > Key: HUDI-7640 > URL: https://issues.apache.org/jira/browse/HUDI-7640 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath
Danny Chen created HUDI-7640: Summary: Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath Key: HUDI-7640 URL: https://issues.apache.org/jira/browse/HUDI-7640 Project: Apache Hudi Issue Type: Improvement Components: core Reporter: Danny Chen Assignee: Danny Chen Fix For: 0.15.0, 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7640) Uses UUID as temporary file suffix for HoodieWrapperFileSystem.createImmutableFileInPath
[ https://issues.apache.org/jira/browse/HUDI-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7640: - Sprint: Sprint 2024-03-25 > Uses UUID as temporary file suffix for > HoodieWrapperFileSystem.createImmutableFileInPath > > > Key: HUDI-7640 > URL: https://issues.apache.org/jira/browse/HUDI-7640 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]
danny0405 commented on code in PR #10886: URL: https://github.com/apache/hudi/pull/10886#discussion_r1571564604 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java: ## @@ -92,37 +94,33 @@ public int getPartitionDepth() { /** * Write the metadata safely into partition atomically. + * To avoid concurrent write into the same partition (for example in speculative case), + * please make sure writeToken is unique. */ - public void trySave(int taskPartitionId) { + public void trySave(String writeToken) throws IOException { String extension = getMetafileExtension(); -Path tmpMetaPath = -new Path(partitionPath, HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + taskPartitionId + extension); Path metaPath = new Path(partitionPath, HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + extension); -boolean metafileExists = false; -try { - metafileExists = fs.exists(metaPath); - if (!metafileExists) { -// write to temporary file -writeMetafile(tmpMetaPath); -// move to actual path -fs.rename(tmpMetaPath, metaPath); - } -} catch (IOException ioe) { - LOG.warn("Error trying to save partition metadata (this is okay, as long as at least 1 of these succeeded), " - + partitionPath, ioe); -} finally { - if (!metafileExists) { -try { - // clean up tmp file, if still lying around - if (fs.exists(tmpMetaPath)) { -fs.delete(tmpMetaPath, false); +// This retry mechanism enables an exit-fast in metaPath exists check, which avoid the +// tasks failures when there are two or more tasks trying to create the same metaPath. +RetryHelper retryHelper = new RetryHelper(1000, 3, 1000, IOException.class.getName()) +.tryWith(() -> { + if (!fs.exists(metaPath)) { +if (format.isPresent()) { + Path tmpMetaPath = new Path(partitionPath, HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + "_" + writeToken + extension); + writeMetafileInFormat(metaPath, tmpMetaPath, format.get()); +} else { + // Backwards compatible properties file format + try (ByteArrayOutputStream os = new ByteArrayOutputStream()) { +props.store(os, "partition metadata"); +Option content = Option.of(os.toByteArray()); +HadoopFSUtils.createImmutableFileInPath(fs, metaPath, content, true, "_" + writeToken); Review Comment: Can we eliminate the writeToken and just use the `HadoopFSUtils.createImmutableFileInPath` to write the files directly? You may need to refactorer the method `HadoopFSUtils.createImmutableFileInPath` a little to use UUID as the temporal suffix too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7515) Fix partition metadata write failure
[ https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7515: - Status: Patch Available (was: In Progress) > Fix partition metadata write failure > > > Key: HUDI-7515 > URL: https://issues.apache.org/jira/browse/HUDI-7515 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > Attachments: screenshot-1.png > > > Avoid failing to write partition metadata. When spark.speculation is enabled, > if the write metadata operation become slow for some reason, a speculative > will be started to write the same metadata file concurrently. > In HDFS, two tasks(like one is speculate task) writing to the same file could > both throw exception like so: > {code:bash} > File does not exist: > /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode > 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any > open files. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7515) Fix partition metadata write failure
[ https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7515: - Sprint: Sprint 2024-03-25 > Fix partition metadata write failure > > > Key: HUDI-7515 > URL: https://issues.apache.org/jira/browse/HUDI-7515 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > Attachments: screenshot-1.png > > > Avoid failing to write partition metadata. When spark.speculation is enabled, > if the write metadata operation become slow for some reason, a speculative > will be started to write the same metadata file concurrently. > In HDFS, two tasks(like one is speculate task) writing to the same file could > both throw exception like so: > {code:bash} > File does not exist: > /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode > 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any > open files. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7515) Fix partition metadata write failure
[ https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7515: - Status: In Progress (was: Open) > Fix partition metadata write failure > > > Key: HUDI-7515 > URL: https://issues.apache.org/jira/browse/HUDI-7515 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > Attachments: screenshot-1.png > > > Avoid failing to write partition metadata. When spark.speculation is enabled, > if the write metadata operation become slow for some reason, a speculative > will be started to write the same metadata file concurrently. > In HDFS, two tasks(like one is speculate task) writing to the same file could > both throw exception like so: > {code:bash} > File does not exist: > /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode > 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any > open files. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7515) Fix partition metadata write failure
[ https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7515: - Fix Version/s: 0.15.0 1.0.0 > Fix partition metadata write failure > > > Key: HUDI-7515 > URL: https://issues.apache.org/jira/browse/HUDI-7515 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > Attachments: screenshot-1.png > > > Avoid failing to write partition metadata. When spark.speculation is enabled, > if the write metadata operation become slow for some reason, a speculative > will be started to write the same metadata file concurrently. > In HDFS, two tasks(like one is speculate task) writing to the same file could > both throw exception like so: > {code:bash} > File does not exist: > /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode > 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any > open files. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7515) Fix partition metadata write failure
[ https://issues.apache.org/jira/browse/HUDI-7515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen reassigned HUDI-7515: Assignee: Danny Chen > Fix partition metadata write failure > > > Key: HUDI-7515 > URL: https://issues.apache.org/jira/browse/HUDI-7515 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png > > > Avoid failing to write partition metadata. When spark.speculation is enabled, > if the write metadata operation become slow for some reason, a speculative > will be started to write the same metadata file concurrently. > In HDFS, two tasks(like one is speculate task) writing to the same file could > both throw exception like so: > {code:bash} > File does not exist: > /path/to/table/a=3519/b=3520/c=3521/.hoodie_partition_metadata_112 (inode > 48415575374) Holder DFSClient_NONMAPREDUCE_-2108606624_29 does not have any > open files. > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Flink-Hudi - Upsert into the same Hudi table via two different Flink pipelines (stream and batch) [hudi]
danny0405 commented on issue #10914: URL: https://github.com/apache/hudi/issues/10914#issuecomment-2065542560 You may need to read this doc first: https://www.yuque.com/yuzhao-my9fz/kb/flqll8? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
hudi-bot commented on PR #11008: URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065517817 ## CI report: * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
hudi-bot commented on PR #11008: URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065473833 ## CI report: * e5a2713d07581824214bcc7b9321e3d1cb371c02 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23322) * 82029e70eec8c77e1c64bf9f751200c6962777ec Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23351) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065473691 ## CI report: * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349) * 89078f34a2dafff26d47d8a201a59d8bf8a540ba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23350) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Make ordering deterministic in small file selection [hudi]
hudi-bot commented on PR #11008: URL: https://github.com/apache/hudi/pull/11008#issuecomment-2065467861 ## CI report: * e5a2713d07581824214bcc7b9321e3d1cb371c02 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23322) * 82029e70eec8c77e1c64bf9f751200c6962777ec UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065467751 ## CI report: * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349) * 89078f34a2dafff26d47d8a201a59d8bf8a540ba UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065415574 ## CI report: * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink-Hudi - Upsert into the same Hudi table via two different Flink pipelines (stream and batch) [hudi]
ChiehFu commented on issue #10914: URL: https://github.com/apache/hudi/issues/10914#issuecomment-2065368620 In addition, I found some duplicates written by my bulk_insert batch job 1 and upsert stream job 2 (the one that had index bootstrap enabled). For bulk_insert batch job, it had `write.precombine` set to `true` so there shouldn't be any duplicates in the result table? For upsert stream job, it had `write.precombine` set to `true` and index bootstrap task had parallelism set to `480`. I found this previous issue https://github.com/apache/hudi/issues/4881 which suggests duplicates can happen when index bootstrap task parallelism > 1. Is that still the case in Hudi 0.14.1? The table that needs to be index bootstrapped is large so I am not sure if setting parallelism to `1` would work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065344992 ## CI report: * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348) * 96a371f7fca39943737731bd18b9e52af37955e8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23349) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2065335907 ## CI report: * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348) * 96a371f7fca39943737731bd18b9e52af37955e8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7588) Replace hadoop Configuration with StorageConfiguration in hudi-common module
[ https://issues.apache.org/jira/browse/HUDI-7588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7588: Status: In Progress (was: Open) > Replace hadoop Configuration with StorageConfiguration in hudi-common module > > > Key: HUDI-7588 > URL: https://issues.apache.org/jira/browse/HUDI-7588 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
yihua closed pull request #10980: [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance URL: https://github.com/apache/hudi/pull/10980 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am trying to fix here. lets say timeline is as follows dc1 dc2 rc3 dc4 clean5 // cleans up data files from dc1 and dc2 since it was replaced by rc3. as per master, HoodieCDCExtractor goes over commit metadata in activetimeline and tries to deduce base files for log files found. In this case, all data files from dc1 and dc2 are already deleted by clean5. but HoodieCDCExtractor tries to parse data files from dc1 and dc2 and so we might hit file not found issue as per master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7532] Include only compaction instants for lastCompaction in getDeltaCommitsSinceLatestCompaction [hudi]
nsivabalan commented on code in PR #10915: URL: https://github.com/apache/hudi/pull/10915#discussion_r1571247566 ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCExtractor.java: ## @@ -114,6 +114,24 @@ public Map> extractCDCFileSplits() { ValidationUtils.checkState(commits != null, "Empty commits"); Map> fgToCommitChanges = new HashMap<>(); + Review Comment: nope. I am saying HoodieCDCExtractor.java has some inherent bug which I am trying to fix here. lets say timeline is as follows dc1 dc2 rc3 dc4 clean5 // cleans up data files from dc1 and dc2 since it was replaced by rc3. as per master, HoodieCDCExtractor goes over commit metadata in activetimeline and tries to deduce base files for log files found. In this case, all data files from dc1 and dc2 are already deleted by clean5. And so we might hit file not found issue as per master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2064993735 ## CI report: * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2064969766 ## CI report: * 5f8a1f1f175c99c5f1fb36c46de04cee1eaab88e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23347) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2064968952 ## CI report: * 72e09f67466fcfa61b0ec555fc2eecfa52fbb856 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23285) * 24be89663d1b95cf7db83dd39378a675a54b98fc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23348) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7635. --- Resolution: Fixed > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7636) Make StoragePath Serializable
[ https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7636. --- Resolution: Fixed > Make StoragePath Serializable > - > > Key: HUDI-7636 > URL: https://issues.apache.org/jira/browse/HUDI-7636 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7637) Make StoragePathInfo Comparable
[ https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-7637. --- Resolution: Fixed > Make StoragePathInfo Comparable > --- > > Key: HUDI-7637 > URL: https://issues.apache.org/jira/browse/HUDI-7637 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7639) Refactor HoodieFileIndex so that different indexes can be used via optimizer rules
[ https://issues.apache.org/jira/browse/HUDI-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7639: Sprint: Sprint 2024-03-25 > Refactor HoodieFileIndex so that different indexes can be used via optimizer > rules > -- > > Key: HUDI-7639 > URL: https://issues.apache.org/jira/browse/HUDI-7639 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > Currently, `HoodieFileIndex` is responsible for partition pruning as well as > file skipping. All indexes are being used in > [lookupCandidateFilesInMetadataTable|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L333] > method through if-else branches. This is not only hard to maintain as we add > more indexes, but also induces a static hierarchy. Instead, we need more > flexibility so that we can alter logical plan based on availability of > indexes. For partition pruning in Spark, we already have > [HoodiePruneFileSourcePartitions|https://github.com/apache/hudi/blob/b5b14f7d4fa6224a6674b021664b510c6ae8afb9/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodiePruneFileSourcePartitions.scala#L40] > rule but it is injected during the operator optimization batch and it does > not modify the result of the LogicalPlan. To be fully extensible, we should > be able to rewrite the LogicalPlan. We should be able to inject rules after > partition pruning after the operator optimization batch and before any CBO > rules that depend on stats. Spark provides > [injectPreCBORules|https://github.com/apache/spark/blob/6232085227ee2cc4e831996a1ac84c27868a1595/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala#L304] > API to do so, however it is only available in Spark 3.1.0 onwards. > The goal of this ticket is to refactor index hierarchy and create new rules > such that Spark version < 3.1.0 still go via the old path, while later > versions can modify the plan using an appropriate index and inject as a > pre-CBO rule. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7633) Use try with resources for AutoCloseable
[ https://issues.apache.org/jira/browse/HUDI-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7633: Sprint: Sprint 2024-03-25 > Use try with resources for AutoCloseable > > > Key: HUDI-7633 > URL: https://issues.apache.org/jira/browse/HUDI-7633 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7635) Add default block size and openSeekable APIs to HoodieStorage
[ https://issues.apache.org/jira/browse/HUDI-7635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7635: Sprint: Sprint 2024-03-25 > Add default block size and openSeekable APIs to HoodieStorage > - > > Key: HUDI-7635 > URL: https://issues.apache.org/jira/browse/HUDI-7635 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7638) Add metrics to HoodieStorage implementation that is not hadoop-dependent
[ https://issues.apache.org/jira/browse/HUDI-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7638: Sprint: Sprint 2024-03-25 > Add metrics to HoodieStorage implementation that is not hadoop-dependent > > > Key: HUDI-7638 > URL: https://issues.apache.org/jira/browse/HUDI-7638 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7637) Make StoragePathInfo Comparable
[ https://issues.apache.org/jira/browse/HUDI-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7637: Sprint: Sprint 2024-03-25 > Make StoragePathInfo Comparable > --- > > Key: HUDI-7637 > URL: https://issues.apache.org/jira/browse/HUDI-7637 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7634) Rename HoodieStorage APIs
[ https://issues.apache.org/jira/browse/HUDI-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7634: Sprint: Sprint 2024-03-25 > Rename HoodieStorage APIs > - > > Key: HUDI-7634 > URL: https://issues.apache.org/jira/browse/HUDI-7634 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > > getHoodieStorage -> getStorage > listDirectEntries -> listDirectory -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7636) Make StoragePath Serializable
[ https://issues.apache.org/jira/browse/HUDI-7636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7636: Sprint: Sprint 2024-03-25 > Make StoragePath Serializable > - > > Key: HUDI-7636 > URL: https://issues.apache.org/jira/browse/HUDI-7636 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6497) Replace FileSystem, Path, and FileStatus usage in hudi-common
[ https://issues.apache.org/jira/browse/HUDI-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6497. --- Resolution: Fixed > Replace FileSystem, Path, and FileStatus usage in hudi-common > - > > Key: HUDI-6497 > URL: https://issues.apache.org/jira/browse/HUDI-6497 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality >Reporter: Vinoth Chandar >Assignee: Ethan Guo >Priority: Blocker > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [Inquiry] Does HoodieIndexer can Do Indexing for RLI Async Fashion [hudi]
soumilshah1995 commented on issue #10815: URL: https://github.com/apache/hudi/issues/10815#issuecomment-2064849319 Thanks for heads up guys -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2064823328 ## CI report: * 98f4d4d4b61df443ca8c46078d921919f83e8595 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23324) * 5f8a1f1f175c99c5f1fb36c46de04cee1eaab88e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23347) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2064822612 ## CI report: * 72e09f67466fcfa61b0ec555fc2eecfa52fbb856 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23285) * 24be89663d1b95cf7db83dd39378a675a54b98fc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]
hudi-bot commented on PR #11043: URL: https://github.com/apache/hudi/pull/11043#issuecomment-2064800265 ## CI report: * 98f4d4d4b61df443ca8c46078d921919f83e8595 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23324) * 5f8a1f1f175c99c5f1fb36c46de04cee1eaab88e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7634] Rename HoodieStorage APIs [hudi]
nsivabalan commented on PR #11047: URL: https://github.com/apache/hudi/pull/11047#issuecomment-2064791962 #getHoodieStorage -> #getStorage since this is in tests, I am ok. if it had been in source code, we should align method name w/ the class name. ok w/ the patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1571170811 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -242,7 +252,44 @@ protected Pair, Schema> getRecordsIterator(HoodieDataBlock d } else { blockRecordsIterator = dataBlock.getEngineRecordIterator(readerContext); } -return Pair.of(blockRecordsIterator, dataBlock.getSchema()); +Option, Schema>> schemaEvolutionTransformerOpt = Review Comment: Just the buffers for now. It's possible. IDK what the perf hit would be to read the file footer twice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1571167483 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -706,6 +709,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { } test("Test schema auto evolution") { +//This test will be flakey for mor until [HUDI-6798] is landed and we can set the merge mode Review Comment: Looks like this test was fixed when the default payload was changed. Now it should work fine with the fg reader -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1571162253 ## hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/spark/execution/datasources/parquet/TestHoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -34,63 +32,17 @@ class TestHoodieFileGroupReaderBasedParquetFileFormat extends SparkClientFunctio IsNotNull("non_key_column"), EqualTo("non_key_column", 1) ) -val filtersWithoutKeyColumn = HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters( +val filtersWithoutKeyColumn = SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters( filters, "key_column"); assertEquals(0, filtersWithoutKeyColumn.size) val filtersWithKeys = Seq( EqualTo("key_column", 1), GreaterThan("non_key_column", 2) ) -val filtersWithKeyColumn = HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters( +val filtersWithKeyColumn = SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters( filtersWithKeys, "key_column") assertEquals(1, filtersWithKeyColumn.size) assertEquals("key_column", filtersWithKeyColumn.head.references.head) } - - @Test - def testGetAppliedRequiredSchema(): Unit = { -val fields = Array( - StructField("column_a", LongType, nullable = false), - StructField("column_b", StringType, nullable = false)) -val requiredSchema = StructType(fields) - -val appliedSchema: StructType = HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema( - requiredSchema, shouldUseRecordPosition = true, "row_index") -if (HoodieSparkUtils.gteqSpark3_5) { - assertEquals(3, appliedSchema.fields.length) -} else { - assertEquals(2, appliedSchema.fields.length) -} - -val schemaWithoutRowIndexColumn = HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema( - requiredSchema, shouldUseRecordPosition = false, "row_index") -assertEquals(2, schemaWithoutRowIndexColumn.fields.length) - } - - @Test - def testGetAppliedFilters(): Unit = { -val filters = Seq( Review Comment: No. They don't make sense anymore. ## hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/spark/execution/datasources/parquet/TestHoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -34,63 +32,17 @@ class TestHoodieFileGroupReaderBasedParquetFileFormat extends SparkClientFunctio IsNotNull("non_key_column"), EqualTo("non_key_column", 1) ) -val filtersWithoutKeyColumn = HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters( +val filtersWithoutKeyColumn = SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters( filters, "key_column"); assertEquals(0, filtersWithoutKeyColumn.size) val filtersWithKeys = Seq( EqualTo("key_column", 1), GreaterThan("non_key_column", 2) ) -val filtersWithKeyColumn = HoodieFileGroupReaderBasedParquetFileFormat.getRecordKeyRelatedFilters( +val filtersWithKeyColumn = SparkFileFormatInternalRowReaderContext.getRecordKeyRelatedFilters( filtersWithKeys, "key_column") assertEquals(1, filtersWithKeyColumn.size) assertEquals("key_column", filtersWithKeyColumn.head.references.head) } - - @Test - def testGetAppliedRequiredSchema(): Unit = { -val fields = Array( - StructField("column_a", LongType, nullable = false), - StructField("column_b", StringType, nullable = false)) -val requiredSchema = StructType(fields) - -val appliedSchema: StructType = HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema( - requiredSchema, shouldUseRecordPosition = true, "row_index") -if (HoodieSparkUtils.gteqSpark3_5) { - assertEquals(3, appliedSchema.fields.length) -} else { - assertEquals(2, appliedSchema.fields.length) -} - -val schemaWithoutRowIndexColumn = HoodieFileGroupReaderBasedParquetFileFormat.getAppliedRequiredSchema( - requiredSchema, shouldUseRecordPosition = false, "row_index") -assertEquals(2, schemaWithoutRowIndexColumn.fields.length) - } - - @Test - def testGetAppliedFilters(): Unit = { -val filters = Seq( Review Comment: No. They don't make sense anymore to have. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1571163565 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -138,6 +138,7 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { spark.sessionState.catalog.dropTable(TableIdentifier(tableName), true, true) spark.sessionState.catalog.refreshTable(TableIdentifier(tableName)) spark.sessionState.conf.unsetConf(DataSourceWriteOptions.SPARK_SQL_INSERT_INTO_OPERATION.key) + spark.sessionState.conf.unsetConf("spark.sql.storeAssignmentPolicy") Review Comment: this config was leaking between tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1571161187 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -129,20 +138,15 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, file.partitionValues match { // Snapshot or incremental queries. case fileSliceMapping: HoodiePartitionFileSliceMapping => - val filePath = sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file) - val filegroupName = if (FSUtils.isLogFile(filePath)) { -FSUtils.getFileId(filePath.getName).substring(1) - } else { -FSUtils.getFileId(filePath.getName) - } + val filegroupName = FSUtils.getFileIdFromFilePath(sparkAdapter +.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file)) fileSliceMapping.getSlice(filegroupName) match { case Some(fileSlice) if !isCount => if (requiredSchema.isEmpty && !fileSlice.getLogFiles.findAny().isPresent) { val hoodieBaseFile = fileSlice.getBaseFile.get() baseFileReader(createPartitionedFile(fileSliceMapping.getPartitionValues, hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)) } else { -val readerContext: HoodieReaderContext[InternalRow] = new SparkFileFormatInternalRowReaderContext( - readerMaps) +val readerContext = new SparkFileFormatInternalRowReaderContext(parquetFileReader.value, tableState.recordKeyField, filters) Review Comment: We have also not moved any logic for filters to be in the fg reader. That is still handled just by the spark context for now. tableState.recordKeyField is used only for filter purposes. We cannot get rid of the parquetFileReader param. That contains only spark specific configs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1571157403 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -129,20 +138,15 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, file.partitionValues match { // Snapshot or incremental queries. case fileSliceMapping: HoodiePartitionFileSliceMapping => - val filePath = sparkAdapter.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file) - val filegroupName = if (FSUtils.isLogFile(filePath)) { -FSUtils.getFileId(filePath.getName).substring(1) - } else { -FSUtils.getFileId(filePath.getName) - } + val filegroupName = FSUtils.getFileIdFromFilePath(sparkAdapter +.getSparkPartitionedFileUtils.getPathFromPartitionedFile(file)) fileSliceMapping.getSlice(filegroupName) match { case Some(fileSlice) if !isCount => if (requiredSchema.isEmpty && !fileSlice.getLogFiles.findAny().isPresent) { val hoodieBaseFile = fileSlice.getBaseFile.get() baseFileReader(createPartitionedFile(fileSliceMapping.getPartitionValues, hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)) } else { -val readerContext: HoodieReaderContext[InternalRow] = new SparkFileFormatInternalRowReaderContext( - readerMaps) +val readerContext = new SparkFileFormatInternalRowReaderContext(parquetFileReader.value, tableState.recordKeyField, filters) Review Comment: The intention is to not have duplicate logic for creating the readers and then passing in a map of already created readers. Now, we create the readers on the executor. Before we called buildReaderWithPartitionValues() a few times and had a map from schema hash to PartitionedFile => Iterator[InternalRow] Now we have a reader that we call read on the executor and can pass in the schema and filters that we want with the file. We have removed the limitation that schema and filters be known in the driver -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]
jonvex commented on code in PR #10957: URL: https://github.com/apache/hudi/pull/10957#discussion_r1571152424 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -107,19 +112,23 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, val dataSchema = StructType(tableSchema.structTypeSchema.fields.filterNot(f => partitionColumns.contains(f.name))) val outputSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) spark.conf.set("spark.sql.parquet.enableVectorizedReader", supportBatchResult) -val requiredSchemaWithMandatory = generateRequiredSchemaWithMandatory(requiredSchema, dataSchema, partitionSchema) -val isCount = requiredSchemaWithMandatory.isEmpty -val requiredSchemaSplits = requiredSchemaWithMandatory.fields.partition(f => HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(f.name)) -val requiredMeta = StructType(requiredSchemaSplits._1) -val requiredWithoutMeta = StructType(requiredSchemaSplits._2) +val isCount = requiredSchema.isEmpty && !isMOR && !isIncremental val augmentedHadoopConf = FSUtils.buildInlineConf(hadoopConf) -val (baseFileReader, preMergeBaseFileReader, readerMaps, cdcFileReader) = buildFileReaders( - spark, dataSchema, partitionSchema, requiredSchema, filters, options, augmentedHadoopConf, - requiredSchemaWithMandatory, requiredWithoutMeta, requiredMeta) +setSchemaEvolutionConfigs(augmentedHadoopConf, options) +val baseFileReader = super.buildReaderWithPartitionValues(spark, dataSchema, partitionSchema, requiredSchema, + filters ++ requiredFilters, options, new Configuration(augmentedHadoopConf)) +val cdcFileReader = super.buildReaderWithPartitionValues( + spark, + tableSchema.structTypeSchema, + StructType(Nil), + tableSchema.structTypeSchema, + Nil, + options, + new Configuration(hadoopConf)) val requestedAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(requiredSchema, sanitizedTableName) val dataAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName) - +val parquetFileReader = spark.sparkContext.broadcast(sparkAdapter.createParquetFileReader(supportBatchResult, spark.sessionState.conf, options, augmentedHadoopConf)) Review Comment: No. The spark confs don't make it to the executors if you remember. Instantiating the reader just gets the value of the configs we need so that we can send them to the executor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org