Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026285922 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -103,6 +104,15 @@ case class HoodieFileIndex(spark: SparkSession, endCompletionTime = options.get(DataSourceReadOptions.END_COMMIT.key)) with FileIndex { @transient protected var hasPushedDownPartitionPredicates: Boolean = false + private val isPartitionSimpleBucketIndex = PartitionBucketIndexUtils.isPartitionSimpleBucketIndex(spark.sparkContext.hadoopConfiguration, +metaClient.getBasePath.toString) + + @transient private lazy val bucketIndexSupport = if (isPartitionSimpleBucketIndex) { +val specifiedQueryInstant = options.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key).map(HoodieSqlCommonUtils.formatQueryInstant) +new PartitionBucketIndexSupport(spark, metadataConfig, metaClient, specifiedQueryInstant) Review Comment: Same as `TimeTravel` scenarios mentioned above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2021979593 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); Review Comment: Can we assume there is always config files once the partition bucket index is enabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026498805 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1321,6 +1321,49 @@ void testBulkInsertWithPartitionBucketIndex(String operationType, String tableTy assertEquals(expected.stream().sorted().collect(Collectors.toList()), actual.stream().sorted().collect(Collectors.toList())); } + @Test + void tesQueryWithPartitionBucketIndexPruning() { +String operationType = "upsert"; +String tableType = "MERGE_ON_READ"; +TableEnvironment tableEnv = batchTableEnv; +// csv source +String csvSourceDDL = TestConfigurations.getCsvSourceDDL("csv_source", "test_source_5.data"); +tableEnv.executeSql(csvSourceDDL); +String catalogName = "hudi_" + operationType; +String hudiCatalogDDL = catalog(catalogName) +.catalogPath(tempFile.getAbsolutePath()) +.end(); + +tableEnv.executeSql(hudiCatalogDDL); +String dbName = "hudi"; +tableEnv.executeSql("create database " + catalogName + "." + dbName); +String basePath = tempFile.getAbsolutePath() + "/hudi/hoodie_sink"; + +String hoodieTableDDL = sql(catalogName + ".hudi.hoodie_sink") +.option(FlinkOptions.PATH, basePath) +.option(FlinkOptions.TABLE_TYPE, tableType) +.option(FlinkOptions.OPERATION, operationType) +.option(FlinkOptions.WRITE_BULK_INSERT_SHUFFLE_INPUT, true) +.option(FlinkOptions.INDEX_TYPE, "BUCKET") +.option(FlinkOptions.HIVE_STYLE_PARTITIONING, "true") +.option(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, "1") +.option(FlinkOptions.BUCKET_INDEX_PARTITION_RULE, "regex") +.option(FlinkOptions.BUCKET_INDEX_PARTITION_EXPRESSIONS, "partition=(par1|par2),2") +.end(); +tableEnv.executeSql(hoodieTableDDL); + +String insertInto = "insert into " + catalogName + ".hudi.hoodie_sink select * from csv_source"; +execInsertSql(tableEnv, insertInto); + +List result1 = CollectionUtil.iterableToList( Review Comment: use `execSelectSql(TableEnvironment tEnv, String select)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026283652 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestPartitionBucketPruning.java: ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table; + +import org.apache.hudi.common.model.PartitionBucketIndexHashingConfig; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.index.bucket.BucketIdentifier; +import org.apache.hudi.source.prune.PrimaryKeyPruners; +import org.apache.hudi.storage.StoragePath; +import org.apache.hudi.storage.StoragePathInfo; +import org.apache.hudi.util.SerializableSchema; +import org.apache.hudi.util.StreamerUtil; +import org.apache.hudi.utils.TestConfigurations; +import org.apache.hudi.utils.TestData; + +import org.apache.flink.configuration.Configuration; +import org.apache.flink.table.api.DataTypes; +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.types.DataType; +import org.apache.hadoop.fs.Path; +import org.hamcrest.CoreMatchers; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.core.Is.is; + +public class TestPartitionBucketPruning { Review Comment: change also add a IT named `tesQueryWithPartitionBucketIndexPruning` to do query result validation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776205596 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603) * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 merged PR #13060: URL: https://github.com/apache/hudi/pull/13060 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780207733 > I have made another patch to fix the bucket pruning logging: [fix_the_bucket_pruning_logging.patch.zip](https://github.com/user-attachments/files/19612894/fix_the_bucket_pruning_logging.patch.zip) Done. Also CI passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780204472 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * ecd6bb61d253d969a26fd97b7c52d5d79c5c02aa Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4672) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780163840 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 Azure: [CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4670) * ecd6bb61d253d969a26fd97b7c52d5d79c5c02aa Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4672) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780163078 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659) * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4670) * ecd6bb61d253d969a26fd97b7c52d5d79c5c02aa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780135335 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659) * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4670) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780130591 I have made another patch to fix the bucket pruning logging: [13060_2.patch.zip](https://github.com/user-attachments/files/19612840/13060_2.patch.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780128938 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659) * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2029646256 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java: ## @@ -157,28 +157,34 @@ public List getFilesInPartitions() { if (partitions.length < 1) { return Collections.emptyList(); } -List allFiles = FSUtils.getFilesInPartitions( -new HoodieFlinkEngineContext(hadoopConf), -new HoodieHadoopStorage(path, HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), partitions) -.values().stream() -.flatMap(Collection::stream) -.collect(Collectors.toList()); +Map> partition2Files = FSUtils.getFilesInPartitions( +new HoodieFlinkEngineContext(hadoopConf), +new HoodieHadoopStorage(path, HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), partitions); + +List allFiles; +// bucket pruning +if (this.partitionBucketIdFunc != null) { + allFiles = partition2Files.entrySet().stream().flatMap(entry -> { +String partitionPath = entry.getKey(); +int bucketId = partitionBucketIdFunc.apply(partitionPath); +String bucketIdStr = BucketIdentifier.bucketIdStr(bucketId); +List filesInPartition = entry.getValue(); +return filesInPartition.stream().filter(fileInfo -> fileInfo.getPath().getName().contains(bucketIdStr)); + }).collect(Collectors.toList()); +} else { + allFiles = FSUtils.getFilesInPartitions( Review Comment: Opsss. changed this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2029632471 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java: ## @@ -157,28 +157,34 @@ public List getFilesInPartitions() { if (partitions.length < 1) { return Collections.emptyList(); } -List allFiles = FSUtils.getFilesInPartitions( -new HoodieFlinkEngineContext(hadoopConf), -new HoodieHadoopStorage(path, HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), partitions) -.values().stream() -.flatMap(Collection::stream) -.collect(Collectors.toList()); +Map> partition2Files = FSUtils.getFilesInPartitions( +new HoodieFlinkEngineContext(hadoopConf), +new HoodieHadoopStorage(path, HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), partitions); + +List allFiles; +// bucket pruning +if (this.partitionBucketIdFunc != null) { + allFiles = partition2Files.entrySet().stream().flatMap(entry -> { +String partitionPath = entry.getKey(); +int bucketId = partitionBucketIdFunc.apply(partitionPath); +String bucketIdStr = BucketIdentifier.bucketIdStr(bucketId); +List filesInPartition = entry.getValue(); +return filesInPartition.stream().filter(fileInfo -> fileInfo.getPath().getName().contains(bucketIdStr)); + }).collect(Collectors.toList()); +} else { + allFiles = FSUtils.getFilesInPartitions( Review Comment: should use `partition2Files` to avoid redundant files fetching. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026506692 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -179,7 +180,7 @@ public HoodieTableSource( @Nullable List predicates, @Nullable ColumnStatsProbe columnStatsProbe, @Nullable PartitionPruners.PartitionPruner partitionPruner, - int dataBucketHashing, + Option> dataBucket, Review Comment: dataBucket -> dataBucketFunc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026507577 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -698,12 +699,7 @@ public ColumnStatsProbe getColumnStatsProbe() { } @VisibleForTesting - public int getDataBucket() { -return BucketIdentifier.getBucketId(this.dataBucketHashing, conf.getInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS)); - } - - @VisibleForTesting - public int getDataBucketHashing() { -return dataBucketHashing; + public Option> getDataBucket() { Review Comment: getDataBucketFunc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778649224 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651) * 2d79e40b46ab9af5b518314c9031bfc46434841d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2025864002 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java: ## @@ -45,7 +45,7 @@ public class PrimaryKeyPruners { public static final int BUCKET_ID_NO_PRUNING = -1; - public static int getBucketId(List hashKeyFilters, Configuration conf) { + public static int getBucketFieldHashing(List hashKeyFilters, Configuration conf) { Review Comment: Maybe we evolve this dataBucket to a function `(num_buckets_per_partition) -> (int)bucketId` to make it somehow more flexible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026506290 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -158,7 +158,7 @@ public class HoodieTableSource implements private List predicates; private ColumnStatsProbe columnStatsProbe; private PartitionPruners.PartitionPruner partitionPruner; - private int dataBucketHashing; + private Option> dataBucket; Review Comment: dataBucket -> dataBucketFunc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: Yes, Danny. After Bucket Rescale is completed, the data layout will change. Therefore, for Spark's Time Travel,something like travel to specific time point snapshot view(https://hudi.apache.org/docs/sql_queries#time-travel-query), Such as executing a query like `SELECT * FROM TIMESTAMP AS OF WHERE `, the Hudi would init specifiedQueryTimestamp through HoodieBaseRelation. ``` protected lazy val specifiedQueryTimestamp: Option[String] = optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key) .map(HoodieSqlCommonUtils.formatQueryInstant) ``` Then get `Schema` and build `fsView` based on `specifiedQueryTimestamp` For constructing the FsView, Hudi will call `getLatestMergedFileSlicesBeforeOrOn(String partitionStr, String maxInstantTime)`, travel fs view to the specified version. At this point, it is also necessary to load the corresponding hashing_config that was valid at that specific timestamp to ensure the historical data layout ``` protected def listLatestFileSlices(globPaths: Seq[StoragePath], partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[FileSlice] = { queryTimestamp match { case Some(ts) => specifiedQueryTimestamp.foreach(t => validateTimestampAsOf(metaClient, t)) val partitionDirs = if (globPaths.isEmpty) { fileIndex.listFiles(partitionFilters, dataFilters) } else { val inMemoryFileIndex = HoodieInMemoryFileIndex.create(sparkSession, globPaths) inMemoryFileIndex.listFiles(partitionFilters, dataFilters) } val fsView = new HoodieTableFileSystemView( metaClient, timeline, sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs) .map(fileStatus => HadoopFSUtils.convertToStoragePathInfo(fileStatus)) .asJava) fsView.getPartitionPaths.asScala.flatMap { partitionPath => val relativePath = getRelativePartitionPath(convertToStoragePath(basePath), partitionPath) fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, ts).iterator().asScala }.toSeq case _ => Seq() } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779329892 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028871226 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: For example DeltaCommit1 ==> Write `C1_File1, C1_File2` Bucket-Rescale Commit2 ==> Write `C2_File1, C2_File2, C2_File3(Replaced C1_File1, C1_File2)` DeltaCommit 3 ==> Write `C3_File1_Log1` Bucket-Rescale Commit4 ==> Write `C4_File1(Replaced C2_File1, C2_File2, C2_File3 and C3_File1_Log1)` For Sql `SELECT * FROM hudi_table TIMESTAMP AS OF `, we need to load Bucket-Rescale Commit2 instead of load latest hashing config Bucket-Rescale Commit4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778828484 > Thanks for the contribution, I have made some refactoring with the given patch: > > [13060.patch.zip](https://github.com/user-attachments/files/19602219/13060.patch.zip) Thanks for your help Danny, All changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774624280 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552) Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553) * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602) * 95c0d03c1707ea4e7d119f5751d042ad6a742664 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2777995679 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619) * b53f457aa1ef63f665add37a75d4f8e9a614780c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779015664 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779161934 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656) * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779038403 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656) * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028875514 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner createPartitionPruner(List dataFilters) { + private Option> getDataBucketFunc(List dataFilters) { if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) { - return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING; + return Option.empty(); } Set indexKeyFields = Arrays.stream(OptionsResolver.getIndexKeys(conf)).collect(Collectors.toSet()); List indexKeyFilters = dataFilters.stream().filter(expr -> ExpressionUtils.isEqualsLitExpr(expr, indexKeyFields)).collect(Collectors.toList()); if (!ExpressionUtils.isFilteringByAllFields(indexKeyFilters, indexKeyFields)) { - return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING; + return Option.empty(); Review Comment: Add a new Spark UT `test("Test BucketID Pruning With Partition Bucket Index")` Without This PR will throw Exception ``` Expected Array([,.0,,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([,.0,,2021-01-05]), but got Array() ``` With PR in `Always load latest hashing config` logic, will throw exception ``` Expected Array([,.0,,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([,.0,,2021-01-05]), but got Array() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: Yes, Danny. After Bucket Rescale is completed, the data layout will change. Therefore, for Spark's Time Travel,something like travel to specific time point snapshot view (not Incremental Query)(https://hudi.apache.org/docs/sql_queries#time-travel-query), Such as executing a query like `SELECT * FROM TIMESTAMP AS OF WHERE `, the Hudi would init specifiedQueryTimestamp through HoodieBaseRelation. ``` protected lazy val specifiedQueryTimestamp: Option[String] = optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key) .map(HoodieSqlCommonUtils.formatQueryInstant) ``` Then get `Schema` and build `fsView` based on `specifiedQueryTimestamp` For constructing the FsView, Hudi will call `getLatestMergedFileSlicesBeforeOrOn(String partitionStr, String maxInstantTime)`, travel fs view to the specified version. At this point, it is also necessary to load the corresponding hashing_config that was valid at that specific timestamp to ensure the historical data layout ``` protected def listLatestFileSlices(globPaths: Seq[StoragePath], partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[FileSlice] = { queryTimestamp match { case Some(ts) => specifiedQueryTimestamp.foreach(t => validateTimestampAsOf(metaClient, t)) val partitionDirs = if (globPaths.isEmpty) { fileIndex.listFiles(partitionFilters, dataFilters) } else { val inMemoryFileIndex = HoodieInMemoryFileIndex.create(sparkSession, globPaths) inMemoryFileIndex.listFiles(partitionFilters, dataFilters) } val fsView = new HoodieTableFileSystemView( metaClient, timeline, sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs) .map(fileStatus => HadoopFSUtils.convertToStoragePathInfo(fileStatus)) .asJava) fsView.getPartitionPaths.asScala.flatMap { partitionPath => val relativePath = getRelativePartitionPath(convertToStoragePath(basePath), partitionPath) fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, ts).iterator().asScala }.toSeq case _ => Seq() } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: Yes, Danny. After Bucket Rescale is completed, the data layout will change. Therefore, for Spark's Time Travel (not Incremental Query)(https://hudi.apache.org/docs/sql_queries#time-travel-query), Something like travel to specific time point snapshot view Such as executing a query like `SELECT * FROM TIMESTAMP AS OF WHERE `, the implementation would involve specifying the specifiedQueryTimestamp through HoodieBaseRelation. ``` protected lazy val specifiedQueryTimestamp: Option[String] = optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key) .map(HoodieSqlCommonUtils.formatQueryInstant) ``` When constructing the FsView will call `getLatestMergedFileSlicesBeforeOrOn` ``` protected def listLatestFileSlices(globPaths: Seq[StoragePath], partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[FileSlice] = { queryTimestamp match { case Some(ts) => specifiedQueryTimestamp.foreach(t => validateTimestampAsOf(metaClient, t)) val partitionDirs = if (globPaths.isEmpty) { fileIndex.listFiles(partitionFilters, dataFilters) } else { val inMemoryFileIndex = HoodieInMemoryFileIndex.create(sparkSession, globPaths) inMemoryFileIndex.listFiles(partitionFilters, dataFilters) } val fsView = new HoodieTableFileSystemView( metaClient, timeline, sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs) .map(fileStatus => HadoopFSUtils.convertToStoragePathInfo(fileStatus)) .asJava) fsView.getPartitionPaths.asScala.flatMap { partitionPath => val relativePath = getRelativePartitionPath(convertToStoragePath(basePath), partitionPath) fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, ts).iterator().asScala }.toSeq case _ => Seq() } } ``` Travel the view to the specified version. At this point, it is also necessary to load the corresponding hashing_config that was valid at that specific timestamp to ensure the historical data layout -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028875514 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner createPartitionPruner(List dataFilters) { + private Option> getDataBucketFunc(List dataFilters) { if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) { - return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING; + return Option.empty(); } Set indexKeyFields = Arrays.stream(OptionsResolver.getIndexKeys(conf)).collect(Collectors.toSet()); List indexKeyFilters = dataFilters.stream().filter(expr -> ExpressionUtils.isEqualsLitExpr(expr, indexKeyFields)).collect(Collectors.toList()); if (!ExpressionUtils.isFilteringByAllFields(indexKeyFilters, indexKeyFields)) { - return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING; + return Option.empty(); Review Comment: Add a new Spark UT `test("Test BucketID Pruning With Partition Bucket Index")` Without This PR will throw Exception ``` Expected Array([,.0,,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([,.0,,2021-01-05]), but got Array() ``` With PR in `Always load latest hashing config` logic, will throw exception ``` Expected Array([,.0,,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([,.0,,2021-01-05]), but got Array() ``` ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: Add a new Spark UT `test("Test BucketID Pruning With Partition Bucket Index")` Without This PR will throw Exception ``` Expected Array([,.0,,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([,.0,,2021-01-05]), but got Array() ``` With PR in `Always load latest hashing config` logic, will throw exception ``` Expected Array([,.0,,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([,.0,,2021-01-05]), but got Array() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028871226 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: For example DeltaCommit1 ==> `C1_File1, C1_File2` Bucket-Rescale Commit2 ==> `C2_File1, C2_File2, C2_File3(Replaced C1_File1, C1_File2)` DeltaCommit 3 ==> `C3_File1_Log1` Bucket-Rescale Commit4 ==> `C4_File1(Replaced C2_File1, C2_File2, C2_File3 and C3_File1_Log1)` For Sql `SELECT * FROM hudi_table TIMESTAMP AS OF `, we need to load Bucket-Rescale Commit2 instead of load latest hashing config Bucket-Rescale Commit4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: Yes, Danny. After Bucket Rescale is completed, the data layout will change. Therefore, for Spark's Time Travel (not Incremental Query)(https://hudi.apache.org/docs/sql_queries#time-travel-query), such as executing a query like `SELECT * FROM TIMESTAMP AS OF WHERE `, the implementation would involve specifying the specifiedQueryTimestamp through HoodieBaseRelation. ``` protected lazy val specifiedQueryTimestamp: Option[String] = optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key) .map(HoodieSqlCommonUtils.formatQueryInstant) ``` When constructing the FsView will call `getLatestMergedFileSlicesBeforeOrOn` ``` protected def listLatestFileSlices(globPaths: Seq[StoragePath], partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[FileSlice] = { queryTimestamp match { case Some(ts) => specifiedQueryTimestamp.foreach(t => validateTimestampAsOf(metaClient, t)) val partitionDirs = if (globPaths.isEmpty) { fileIndex.listFiles(partitionFilters, dataFilters) } else { val inMemoryFileIndex = HoodieInMemoryFileIndex.create(sparkSession, globPaths) inMemoryFileIndex.listFiles(partitionFilters, dataFilters) } val fsView = new HoodieTableFileSystemView( metaClient, timeline, sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs) .map(fileStatus => HadoopFSUtils.convertToStoragePathInfo(fileStatus)) .asJava) fsView.getPartitionPaths.asScala.flatMap { partitionPath => val relativePath = getRelativePartitionPath(convertToStoragePath(basePath), partitionPath) fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, ts).iterator().asScala }.toSeq case _ => Seq() } } ``` Travel the view to the specified version. At this point, it is also necessary to load the corresponding hashing_config that was valid at that specific timestamp to ensure the historical data layout -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778780676 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651) * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654) * b3a978b204ecd08df87cf4116647ce95088eedc0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778784721 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: [CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654) * b3a978b204ecd08df87cf4116647ce95088eedc0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778796309 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: [CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654) * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778652838 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651) * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778359133 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778250151 Thanks for the contribution, I have made some refactoring with the given patch: [13060.patch.zip](https://github.com/user-attachments/files/19602219/13060.patch.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778119931 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619) * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2027921234 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: > Maybe we should load the hashing configuration associated with R3 (the latest bucket rescale operation before or equal to C4). We can not do that because the data file layouts had been changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2027283054 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner createPartitionPruner(List dataFilters) { + private Option> getDataBucketFunc(List dataFilters) { if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) { - return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING; + return Option.empty(); Review Comment: > It looks like the return value is always non-empty, can we just return the function instead of the option instead? Here may return `Option.empty()` when `getDataBucketFunc` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2027280057 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java: ## @@ -69,7 +70,7 @@ public class FileIndex implements Serializable { private final org.apache.hadoop.conf.Configuration hadoopConf; private final PartitionPruners.PartitionPruner partitionPruner; // for partition pruning private final ColumnStatsProbe colStatsProbe; // for probing column stats - private final int dataBucketHashing; // for bucket pruning + private final Option> dataBucket; // for bucket pruning Review Comment: chenged ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -158,7 +158,7 @@ public class HoodieTableSource implements private List predicates; private ColumnStatsProbe columnStatsProbe; private PartitionPruners.PartitionPruner partitionPruner; - private int dataBucketHashing; + private Option> dataBucket; Review Comment: changed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2027283458 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner createPartitionPruner(List dataFilters) { + private Option> getDataBucketFunc(List dataFilters) { if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) { - return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING; + return Option.empty(); } Set indexKeyFields = Arrays.stream(OptionsResolver.getIndexKeys(conf)).collect(Collectors.toSet()); List indexKeyFilters = dataFilters.stream().filter(expr -> ExpressionUtils.isEqualsLitExpr(expr, indexKeyFields)).collect(Collectors.toList()); if (!ExpressionUtils.isFilteringByAllFields(indexKeyFilters, indexKeyFields)) { - return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING; + return Option.empty(); Review Comment: > It looks like the return value is always non-empty, can we just return the function instead of the option instead? Here also may return `Option.empty()` when `getDataBucketFunc` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776622457 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776364946 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603) * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2027280488 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -179,7 +180,7 @@ public HoodieTableSource( @Nullable List predicates, @Nullable ColumnStatsProbe columnStatsProbe, @Nullable PartitionPruners.PartitionPruner partitionPruner, - int dataBucketHashing, + Option> dataBucket, Review Comment: all changed ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java: ## @@ -698,12 +699,7 @@ public ColumnStatsProbe getColumnStatsProbe() { } @VisibleForTesting - public int getDataBucket() { -return BucketIdentifier.getBucketId(this.dataBucketHashing, conf.getInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS)); - } - - @VisibleForTesting - public int getDataBucketHashing() { -return dataBucketHashing; + public Option> getDataBucket() { Review Comment: changed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2027284301 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1321,6 +1321,49 @@ void testBulkInsertWithPartitionBucketIndex(String operationType, String tableTy assertEquals(expected.stream().sorted().collect(Collectors.toList()), actual.stream().sorted().collect(Collectors.toList())); } + @Test + void tesQueryWithPartitionBucketIndexPruning() { +String operationType = "upsert"; +String tableType = "MERGE_ON_READ"; +TableEnvironment tableEnv = batchTableEnv; +// csv source +String csvSourceDDL = TestConfigurations.getCsvSourceDDL("csv_source", "test_source_5.data"); +tableEnv.executeSql(csvSourceDDL); +String catalogName = "hudi_" + operationType; +String hudiCatalogDDL = catalog(catalogName) +.catalogPath(tempFile.getAbsolutePath()) +.end(); + +tableEnv.executeSql(hudiCatalogDDL); +String dbName = "hudi"; +tableEnv.executeSql("create database " + catalogName + "." + dbName); +String basePath = tempFile.getAbsolutePath() + "/hudi/hoodie_sink"; + +String hoodieTableDDL = sql(catalogName + ".hudi.hoodie_sink") +.option(FlinkOptions.PATH, basePath) +.option(FlinkOptions.TABLE_TYPE, tableType) +.option(FlinkOptions.OPERATION, operationType) +.option(FlinkOptions.WRITE_BULK_INSERT_SHUFFLE_INPUT, true) +.option(FlinkOptions.INDEX_TYPE, "BUCKET") +.option(FlinkOptions.HIVE_STYLE_PARTITIONING, "true") +.option(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, "1") +.option(FlinkOptions.BUCKET_INDEX_PARTITION_RULE, "regex") +.option(FlinkOptions.BUCKET_INDEX_PARTITION_EXPRESSIONS, "partition=(par1|par2),2") +.end(); +tableEnv.executeSql(hoodieTableDDL); + +String insertInto = "insert into " + catalogName + ".hudi.hoodie_sink select * from csv_source"; +execInsertSql(tableEnv, insertInto); + +List result1 = CollectionUtil.iterableToList( Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776210253 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603) * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026501136 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java: ## @@ -69,7 +70,7 @@ public class FileIndex implements Serializable { private final org.apache.hadoop.conf.Configuration hadoopConf; private final PartitionPruners.PartitionPruner partitionPruner; // for partition pruning private final ColumnStatsProbe colStatsProbe; // for probing column stats - private final int dataBucketHashing; // for bucket pruning + private final Option> dataBucket; // for bucket pruning Review Comment: dataBucket -> dataBucketFunc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026505878 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java: ## @@ -43,9 +45,7 @@ public class PrimaryKeyPruners { private static final Logger LOG = LoggerFactory.getLogger(PrimaryKeyPruners.class); - public static final int BUCKET_ID_NO_PRUNING = -1; - - public static int getBucketFieldHashing(List hashKeyFilters, Configuration conf) { + public static Option> getBucketId(List hashKeyFilters, Configuration conf) { Review Comment: It looks like the return value is always non-empty, can we just return the function instead of the option instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026504533 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java: ## @@ -43,9 +45,7 @@ public class PrimaryKeyPruners { private static final Logger LOG = LoggerFactory.getLogger(PrimaryKeyPruners.class); - public static final int BUCKET_ID_NO_PRUNING = -1; - - public static int getBucketFieldHashing(List hashKeyFilters, Configuration conf) { + public static Option> getBucketId(List hashKeyFilters, Configuration conf) { Review Comment: getBucketId -> getBucketIdFunc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026500443 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java: ## @@ -69,7 +70,7 @@ public class FileIndex implements Serializable { private final org.apache.hadoop.conf.Configuration hadoopConf; private final PartitionPruners.PartitionPruner partitionPruner; // for partition pruning private final ColumnStatsProbe colStatsProbe; // for probing column stats - private final int dataBucketHashing; // for bucket pruning + private final Option> dataBucket; // for bucket pruning Review Comment: Can we use Java function, can we align the indentation of the comments: ```java // for partition pruning // for probing column stats // for bucket pruning ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774889183 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774730016 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: [CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602) * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774619228 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552) Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553) * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774626980 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: [CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602) * 95c0d03c1707ea4e7d119f5751d042ad6a742664 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774605936 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552) Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553) * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
hudi-bot commented on PR #13060: URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774614184 ## CI report: * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552) Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553) * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN * cbb666d497e8b4ae98e342197ac951f01bbf0a2c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026283113 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java: ## @@ -45,7 +45,7 @@ public class PrimaryKeyPruners { public static final int BUCKET_ID_NO_PRUNING = -1; - public static int getBucketId(List hashKeyFilters, Configuration conf) { + public static int getBucketFieldHashing(List hashKeyFilters, Configuration conf) { Review Comment: Sure, changed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2026282689 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java: ## @@ -68,25 +69,30 @@ public class FileIndex implements Serializable { private final org.apache.hadoop.conf.Configuration hadoopConf; private final PartitionPruners.PartitionPruner partitionPruner; // for partition pruning private final ColumnStatsProbe colStatsProbe; // for probing column stats - private final int dataBucket; // for bucket pruning + private final int dataBucketHashing; // for bucket pruning private List partitionPaths;// cache of partition paths private final FileStatsIndex fileStatsIndex;// for data skipping + private final NumBucketsFunction numBucketsFunction; private FileIndex( StoragePath path, Configuration conf, RowType rowType, ColumnStatsProbe colStatsProbe, PartitionPruners.PartitionPruner partitionPruner, - int dataBucket) { + int dataBucketHashing, + NumBucketsFunction numBucketsFunction) { this.path = path; this.hadoopConf = HadoopConfigurations.getHadoopConf(conf); this.tableExists = StreamerUtil.tableExists(path.toString(), hadoopConf); this.metadataConfig = StreamerUtil.metadataConfig(conf); this.colStatsProbe = isDataSkippingFeasible(conf.get(FlinkOptions.READ_DATA_SKIPPING_ENABLED)) ? colStatsProbe : null; this.partitionPruner = partitionPruner; -this.dataBucket = dataBucket; Review Comment: evolve this dataBucket to a function -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2025050558 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); Review Comment: Yes we can. Currently, we have two methods to use partition-level bucket indexing: 1. Enabling during table creation via DDL: When enabled during table creation, this method initializes a 0.hashing_config file through the catalog. 2. Upgrading existing table-level bucket indexes via the CALL command: For tables that already use table-level bucket indexing, invoking a CALL command triggers an upgrade process. This generates a replace-commit instant and initializes a corresponding .hashing_config file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2025037864 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java: ## @@ -45,7 +45,7 @@ public class PrimaryKeyPruners { public static final int BUCKET_ID_NO_PRUNING = -1; - public static int getBucketId(List hashKeyFilters, Configuration conf) { + public static int getBucketFieldHashing(List hashKeyFilters, Configuration conf) { Review Comment: Sorry Danny, I didn't get this. Is that possible to get full partition path during original dataBucket computation? ``` @Override public Result applyFilters(List filters) { List simpleFilters = filterSimpleCallExpression(filters); Tuple2, List> splitFilters = splitExprByPartitionCall(simpleFilters, this.partitionKeys, this.tableRowType); this.predicates = ExpressionPredicates.fromExpression(splitFilters.f0); this.columnStatsProbe = ColumnStatsProbe.newInstance(splitFilters.f0); this.partitionPruner = createPartitionPruner(splitFilters.f1, columnStatsProbe); this.dataBucket = getDataBucket(splitFilters.f0); // refuse all the filters now return SupportsFilterPushDown.Result.of(new ArrayList<>(splitFilters.f1), new ArrayList<>(filters)); } ``` What is PR did is get and pass hashing value to `getFilesInPartitions`, then compute numBuckets , finally compute the final bucket id `hashing value % numBuckets` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2025037864 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java: ## @@ -45,7 +45,7 @@ public class PrimaryKeyPruners { public static final int BUCKET_ID_NO_PRUNING = -1; - public static int getBucketId(List hashKeyFilters, Configuration conf) { + public static int getBucketFieldHashing(List hashKeyFilters, Configuration conf) { Review Comment: Sorry Danny, I didn't get this. Is that possible to get full partition path and use it in this new bucketIdFunc In the original code's call site? ``` @Override public Result applyFilters(List filters) { List simpleFilters = filterSimpleCallExpression(filters); Tuple2, List> splitFilters = splitExprByPartitionCall(simpleFilters, this.partitionKeys, this.tableRowType); this.predicates = ExpressionPredicates.fromExpression(splitFilters.f0); this.columnStatsProbe = ColumnStatsProbe.newInstance(splitFilters.f0); this.partitionPruner = createPartitionPruner(splitFilters.f1, columnStatsProbe); this.dataBucket = getDataBucket(splitFilters.f0); // refuse all the filters now return SupportsFilterPushDown.Result.of(new ArrayList<>(splitFilters.f1), new ArrayList<>(filters)); } ``` What is PR did is get and pass hashing value to `getFilesInPartitions`, then compute numBuckets , finally compute the final bucket id `hashing value % numBuckets` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
zhangyue19921010 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2025053222 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: The method `getHashingConfigInstantToLoadBeforeOrOn` is designed to handle Time Travel scenarios. For example, suppose we have a sequence of commits and replace-commits: C1, C2, R3 (a bucket rescale operation), C4, R5 (another bucket rescale). If we perform a Time Travel query targeting commit C4 (i.e., specifiedQueryInstant = C4), Maybe we should load the hashing configuration associated with R3 (the latest bucket rescale operation before or equal to C4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2021975825 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java: ## @@ -68,25 +69,30 @@ public class FileIndex implements Serializable { private final org.apache.hadoop.conf.Configuration hadoopConf; private final PartitionPruners.PartitionPruner partitionPruner; // for partition pruning private final ColumnStatsProbe colStatsProbe; // for probing column stats - private final int dataBucket; // for bucket pruning + private final int dataBucketHashing; // for bucket pruning private List partitionPaths;// cache of partition paths private final FileStatsIndex fileStatsIndex;// for data skipping + private final NumBucketsFunction numBucketsFunction; private FileIndex( StoragePath path, Configuration conf, RowType rowType, ColumnStatsProbe colStatsProbe, PartitionPruners.PartitionPruner partitionPruner, - int dataBucket) { + int dataBucketHashing, + NumBucketsFunction numBucketsFunction) { this.path = path; this.hadoopConf = HadoopConfigurations.getHadoopConf(conf); this.tableExists = StreamerUtil.tableExists(path.toString(), hadoopConf); this.metadataConfig = StreamerUtil.metadataConfig(conf); this.colStatsProbe = isDataSkippingFeasible(conf.get(FlinkOptions.READ_DATA_SKIPPING_ENABLED)) ? colStatsProbe : null; this.partitionPruner = partitionPruner; -this.dataBucket = dataBucket; Review Comment: Can we still use the bucket id here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2021975643 ## hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java: ## @@ -196,24 +196,51 @@ public static Option loadHashingConfig(Hoodie /** * Get Latest committed hashing config instant to load. + * If instant is empty, then return latest hashing config instant */ - public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) { + public static Option getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option instant) { try { List allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient); - return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1); + if (instant.isPresent()) { +Option res = getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, instant.get()); +// fall back to look up archived hashing config instant before return empty +return res.isPresent() ? res : getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient), instant.get()); + } else { +return Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1)); + } } catch (Exception e) { throw new HoodieException("Failed to get hashing config instant to load.", e); } } + private static Option getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, String instant) { Review Comment: We should always load the latest bucket config to comply with the latest bucket id mappings. Otherwise the query would fail. We do not ensure snapshot isolation here because the reader read being affected by the writer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2021978502 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -103,6 +104,15 @@ case class HoodieFileIndex(spark: SparkSession, endCompletionTime = options.get(DataSourceReadOptions.END_COMMIT.key)) with FileIndex { @transient protected var hasPushedDownPartitionPredicates: Boolean = false + private val isPartitionSimpleBucketIndex = PartitionBucketIndexUtils.isPartitionSimpleBucketIndex(spark.sparkContext.hadoopConfiguration, +metaClient.getBasePath.toString) + + @transient private lazy val bucketIndexSupport = if (isPartitionSimpleBucketIndex) { +val specifiedQueryInstant = options.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key).map(HoodieSqlCommonUtils.formatQueryInstant) +new PartitionBucketIndexSupport(spark, metadataConfig, metaClient, specifiedQueryInstant) Review Comment: always query from the latest hash config because there is no SI for reader/writers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2021978135 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestPartitionBucketPruning.java: ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.table; + +import org.apache.hudi.common.model.PartitionBucketIndexHashingConfig; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.index.bucket.BucketIdentifier; +import org.apache.hudi.source.prune.PrimaryKeyPruners; +import org.apache.hudi.storage.StoragePath; +import org.apache.hudi.storage.StoragePathInfo; +import org.apache.hudi.util.SerializableSchema; +import org.apache.hudi.util.StreamerUtil; +import org.apache.hudi.utils.TestConfigurations; +import org.apache.hudi.utils.TestData; + +import org.apache.flink.configuration.Configuration; +import org.apache.flink.table.api.DataTypes; +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.types.DataType; +import org.apache.hadoop.fs.Path; +import org.hamcrest.CoreMatchers; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.core.Is.is; + +public class TestPartitionBucketPruning { Review Comment: Should we move the tests into `TestHoodieTableSource`, can we at least add a IT test for the query result validation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]
danny0405 commented on code in PR #13060: URL: https://github.com/apache/hudi/pull/13060#discussion_r2021977009 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java: ## @@ -45,7 +45,7 @@ public class PrimaryKeyPruners { public static final int BUCKET_ID_NO_PRUNING = -1; - public static int getBucketId(List hashKeyFilters, Configuration conf) { + public static int getBucketFieldHashing(List hashKeyFilters, Configuration conf) { Review Comment: Can we still return the bucket id? We can add a new param for the function: ```java // the input of bucketIdFunc is the partition path getBucketId(List hashKeyFilters, Configuration conf, Function bucketIdFunc) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org