Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-05 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026285922


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -103,6 +104,15 @@ case class HoodieFileIndex(spark: SparkSession,
 endCompletionTime = options.get(DataSourceReadOptions.END_COMMIT.key)) 
with FileIndex {
 
   @transient protected var hasPushedDownPartitionPredicates: Boolean = false
+  private val isPartitionSimpleBucketIndex = 
PartitionBucketIndexUtils.isPartitionSimpleBucketIndex(spark.sparkContext.hadoopConfiguration,
+metaClient.getBasePath.toString)
+
+  @transient private lazy val bucketIndexSupport = if 
(isPartitionSimpleBucketIndex) {
+val specifiedQueryInstant = 
options.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key).map(HoodieSqlCommonUtils.formatQueryInstant)
+new PartitionBucketIndexSupport(spark, metadataConfig, metaClient, 
specifiedQueryInstant)

Review Comment:
   Same as `TimeTravel` scenarios mentioned above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-05 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2021979593


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);

Review Comment:
   Can we assume there is always config files once the partition bucket index 
is enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-05 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026498805


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1321,6 +1321,49 @@ void testBulkInsertWithPartitionBucketIndex(String 
operationType, String tableTy
 assertEquals(expected.stream().sorted().collect(Collectors.toList()), 
actual.stream().sorted().collect(Collectors.toList()));
   }
 
+  @Test
+  void tesQueryWithPartitionBucketIndexPruning() {
+String operationType = "upsert";
+String tableType = "MERGE_ON_READ";
+TableEnvironment tableEnv = batchTableEnv;
+// csv source
+String csvSourceDDL = TestConfigurations.getCsvSourceDDL("csv_source", 
"test_source_5.data");
+tableEnv.executeSql(csvSourceDDL);
+String catalogName = "hudi_" + operationType;
+String hudiCatalogDDL = catalog(catalogName)
+.catalogPath(tempFile.getAbsolutePath())
+.end();
+
+tableEnv.executeSql(hudiCatalogDDL);
+String dbName = "hudi";
+tableEnv.executeSql("create database " + catalogName + "." + dbName);
+String basePath = tempFile.getAbsolutePath() + "/hudi/hoodie_sink";
+
+String hoodieTableDDL = sql(catalogName + ".hudi.hoodie_sink")
+.option(FlinkOptions.PATH, basePath)
+.option(FlinkOptions.TABLE_TYPE, tableType)
+.option(FlinkOptions.OPERATION, operationType)
+.option(FlinkOptions.WRITE_BULK_INSERT_SHUFFLE_INPUT, true)
+.option(FlinkOptions.INDEX_TYPE, "BUCKET")
+.option(FlinkOptions.HIVE_STYLE_PARTITIONING, "true")
+.option(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, "1")
+.option(FlinkOptions.BUCKET_INDEX_PARTITION_RULE, "regex")
+.option(FlinkOptions.BUCKET_INDEX_PARTITION_EXPRESSIONS, 
"partition=(par1|par2),2")
+.end();
+tableEnv.executeSql(hoodieTableDDL);
+
+String insertInto = "insert into " + catalogName + ".hudi.hoodie_sink 
select * from csv_source";
+execInsertSql(tableEnv, insertInto);
+
+List result1 = CollectionUtil.iterableToList(

Review Comment:
   use `execSelectSql(TableEnvironment tEnv, String select)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-05 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026283652


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestPartitionBucketPruning.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.model.PartitionBucketIndexHashingConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.source.prune.PrimaryKeyPruners;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.hudi.storage.StoragePathInfo;
+import org.apache.hudi.util.SerializableSchema;
+import org.apache.hudi.util.StreamerUtil;
+import org.apache.hudi.utils.TestConfigurations;
+import org.apache.hudi.utils.TestData;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.table.api.DataTypes;
+import org.apache.flink.table.expressions.CallExpression;
+import org.apache.flink.table.expressions.FieldReferenceExpression;
+import org.apache.flink.table.expressions.ResolvedExpression;
+import org.apache.flink.table.expressions.ValueLiteralExpression;
+import org.apache.flink.table.functions.BuiltInFunctionDefinitions;
+import org.apache.flink.table.types.DataType;
+import org.apache.hadoop.fs.Path;
+import org.hamcrest.CoreMatchers;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import static org.hamcrest.MatcherAssert.assertThat;
+import static org.hamcrest.core.Is.is;
+
+public class TestPartitionBucketPruning {

Review Comment:
   change also add a IT named `tesQueryWithPartitionBucketIndexPruning` to do 
query result validation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-05 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776205596

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603)
 
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 merged PR #13060:
URL: https://github.com/apache/hudi/pull/13060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780207733

   > I have made another patch to fix the bucket pruning logging: 
[fix_the_bucket_pruning_logging.patch.zip](https://github.com/user-attachments/files/19612894/fix_the_bucket_pruning_logging.patch.zip)
   
   Done. Also CI passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780204472

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * ecd6bb61d253d969a26fd97b7c52d5d79c5c02aa Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4672)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780163840

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4670)
 
   * ecd6bb61d253d969a26fd97b7c52d5d79c5c02aa Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4672)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780163078

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659)
 
   * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4670)
 
   * ecd6bb61d253d969a26fd97b7c52d5d79c5c02aa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780135335

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659)
 
   * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4670)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780130591

   I have made another patch to fix the bucket pruning logging: 
   
[13060_2.patch.zip](https://github.com/user-attachments/files/19612840/13060_2.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2780128938

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659)
 
   * 69bfd3369acf7ebabb3b58055ab583c1f2e2ff23 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2029646256


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java:
##
@@ -157,28 +157,34 @@ public List getFilesInPartitions() {
 if (partitions.length < 1) {
   return Collections.emptyList();
 }
-List allFiles = FSUtils.getFilesInPartitions(
-new HoodieFlinkEngineContext(hadoopConf),
-new HoodieHadoopStorage(path, 
HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), 
partitions)
-.values().stream()
-.flatMap(Collection::stream)
-.collect(Collectors.toList());
+Map> partition2Files = 
FSUtils.getFilesInPartitions(
+new HoodieFlinkEngineContext(hadoopConf),
+new HoodieHadoopStorage(path, 
HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), 
partitions);
+
+List allFiles;
+// bucket pruning
+if (this.partitionBucketIdFunc != null) {
+  allFiles = partition2Files.entrySet().stream().flatMap(entry -> {
+String partitionPath = entry.getKey();
+int bucketId = partitionBucketIdFunc.apply(partitionPath);
+String bucketIdStr = BucketIdentifier.bucketIdStr(bucketId);
+List filesInPartition = entry.getValue();
+return filesInPartition.stream().filter(fileInfo -> 
fileInfo.getPath().getName().contains(bucketIdStr));
+  }).collect(Collectors.toList());
+} else {
+  allFiles = FSUtils.getFilesInPartitions(

Review Comment:
   Opsss. changed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2029632471


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java:
##
@@ -157,28 +157,34 @@ public List getFilesInPartitions() {
 if (partitions.length < 1) {
   return Collections.emptyList();
 }
-List allFiles = FSUtils.getFilesInPartitions(
-new HoodieFlinkEngineContext(hadoopConf),
-new HoodieHadoopStorage(path, 
HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), 
partitions)
-.values().stream()
-.flatMap(Collection::stream)
-.collect(Collectors.toList());
+Map> partition2Files = 
FSUtils.getFilesInPartitions(
+new HoodieFlinkEngineContext(hadoopConf),
+new HoodieHadoopStorage(path, 
HadoopFSUtils.getStorageConf(hadoopConf)), metadataConfig, path.toString(), 
partitions);
+
+List allFiles;
+// bucket pruning
+if (this.partitionBucketIdFunc != null) {
+  allFiles = partition2Files.entrySet().stream().flatMap(entry -> {
+String partitionPath = entry.getKey();
+int bucketId = partitionBucketIdFunc.apply(partitionPath);
+String bucketIdStr = BucketIdentifier.bucketIdStr(bucketId);
+List filesInPartition = entry.getValue();
+return filesInPartition.stream().filter(fileInfo -> 
fileInfo.getPath().getName().contains(bucketIdStr));
+  }).collect(Collectors.toList());
+} else {
+  allFiles = FSUtils.getFilesInPartitions(

Review Comment:
   should use `partition2Files` to avoid redundant files fetching.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026506692


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -179,7 +180,7 @@ public HoodieTableSource(
   @Nullable List predicates,
   @Nullable ColumnStatsProbe columnStatsProbe,
   @Nullable PartitionPruners.PartitionPruner partitionPruner,
-  int dataBucketHashing,
+  Option> dataBucket,

Review Comment:
   dataBucket -> dataBucketFunc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026507577


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -698,12 +699,7 @@ public ColumnStatsProbe getColumnStatsProbe() {
   }
 
   @VisibleForTesting
-  public int getDataBucket() {
-return BucketIdentifier.getBucketId(this.dataBucketHashing, 
conf.getInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS));
-  }
-
-  @VisibleForTesting
-  public int getDataBucketHashing() {
-return dataBucketHashing;
+  public Option> getDataBucket() {

Review Comment:
   getDataBucketFunc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778649224

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651)
 
   * 2d79e40b46ab9af5b518314c9031bfc46434841d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2025864002


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java:
##
@@ -45,7 +45,7 @@ public class PrimaryKeyPruners {
 
   public static final int BUCKET_ID_NO_PRUNING = -1;
 
-  public static int getBucketId(List hashKeyFilters, 
Configuration conf) {
+  public static int getBucketFieldHashing(List 
hashKeyFilters, Configuration conf) {

Review Comment:
   Maybe we evolve this dataBucket to a function `(num_buckets_per_partition) 
-> (int)bucketId` to make it somehow more flexible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026506290


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -158,7 +158,7 @@ public class HoodieTableSource implements
   private List predicates;
   private ColumnStatsProbe columnStatsProbe;
   private PartitionPruners.PartitionPruner partitionPruner;
-  private int dataBucketHashing;
+  private Option> dataBucket;

Review Comment:
   dataBucket -> dataBucketFunc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   Yes, Danny. After Bucket Rescale is completed, the data layout will change. 
Therefore, for Spark's Time Travel,something like travel to specific time point 
snapshot view(https://hudi.apache.org/docs/sql_queries#time-travel-query), 
   
   Such as executing a query like `SELECT * FROM  TIMESTAMP AS OF 
 WHERE `, the Hudi would init 
specifiedQueryTimestamp through HoodieBaseRelation.
   ```
 protected lazy val specifiedQueryTimestamp: Option[String] =
   optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key)
 .map(HoodieSqlCommonUtils.formatQueryInstant)
   ```
   Then get `Schema` and build `fsView` based on `specifiedQueryTimestamp`
   
   For constructing the FsView, Hudi will call 
`getLatestMergedFileSlicesBeforeOrOn(String partitionStr, String 
maxInstantTime)`, travel fs view to the specified version. At this point, it is 
also necessary to load the corresponding hashing_config that was valid at that 
specific timestamp to ensure the historical data layout
   ```
 protected def listLatestFileSlices(globPaths: Seq[StoragePath], 
partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): 
Seq[FileSlice] = {
   queryTimestamp match {
 case Some(ts) =>
   specifiedQueryTimestamp.foreach(t => 
validateTimestampAsOf(metaClient, t))
   
   val partitionDirs = if (globPaths.isEmpty) {
 fileIndex.listFiles(partitionFilters, dataFilters)
   } else {
 val inMemoryFileIndex = 
HoodieInMemoryFileIndex.create(sparkSession, globPaths)
 inMemoryFileIndex.listFiles(partitionFilters, dataFilters)
   }
   
   val fsView = new HoodieTableFileSystemView(
 metaClient, timeline, 
sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs)
   .map(fileStatus => 
HadoopFSUtils.convertToStoragePathInfo(fileStatus))
   .asJava)
   
   fsView.getPartitionPaths.asScala.flatMap { partitionPath =>
 val relativePath = 
getRelativePartitionPath(convertToStoragePath(basePath), partitionPath)
 fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, 
ts).iterator().asScala
   }.toSeq
   
 case _ => Seq()
   }
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779329892

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028871226


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   For example 
   
   DeltaCommit1 ==> Write `C1_File1, C1_File2`
   Bucket-Rescale Commit2 ==> Write `C2_File1, C2_File2, C2_File3(Replaced  
C1_File1, C1_File2)`
   DeltaCommit 3 ==> Write `C3_File1_Log1`
   Bucket-Rescale Commit4 ==> Write `C4_File1(Replaced C2_File1, C2_File2, 
C2_File3 and C3_File1_Log1)`
   
   
   
   For Sql `SELECT * FROM hudi_table TIMESTAMP AS OF `, we need 
to load Bucket-Rescale Commit2 instead of load latest hashing config 
Bucket-Rescale Commit4



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778828484

   > Thanks for the contribution, I have made some refactoring with the given 
patch:
   > 
   > 
[13060.patch.zip](https://github.com/user-attachments/files/19602219/13060.patch.zip)
   
   Thanks for your help Danny, All changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774624280

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552)
 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553)
 
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602)
 
   * 95c0d03c1707ea4e7d119f5751d042ad6a742664 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2777995679

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619)
 
   * b53f457aa1ef63f665add37a75d4f8e9a614780c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779015664

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779161934

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656)
 
   * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4659)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2779038403

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656)
 
   * de0276c24c4b560b6564e5320ea4fb6ab9f8cdbe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028875514


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner 
createPartitionPruner(List dataFilters) {
+  private Option> 
getDataBucketFunc(List dataFilters) {
 if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) {
-  return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+  return Option.empty();
 }
 Set indexKeyFields = 
Arrays.stream(OptionsResolver.getIndexKeys(conf)).collect(Collectors.toSet());
 List indexKeyFilters = 
dataFilters.stream().filter(expr -> ExpressionUtils.isEqualsLitExpr(expr, 
indexKeyFields)).collect(Collectors.toList());
 if (!ExpressionUtils.isFilteringByAllFields(indexKeyFilters, 
indexKeyFields)) {
-  return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+  return Option.empty();

Review Comment:
   Add a new Spark UT `test("Test BucketID Pruning With Partition Bucket 
Index")`
   Without This PR will throw Exception 
   ```
   Expected Array([,.0,,2021-01-05]), but got Array()
   ScalaTestFailureLocation: 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at 
(HoodieSparkSqlTestBase.scala:135)
   org.scalatest.exceptions.TestFailedException: Expected 
Array([,.0,,2021-01-05]), but got Array()
   
   ```
   
   With PR in `Always load latest hashing config` logic, will throw exception
   ```
   Expected Array([,.0,,2021-01-05]), but got Array()
   ScalaTestFailureLocation: 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at 
(HoodieSparkSqlTestBase.scala:135)
   org.scalatest.exceptions.TestFailedException: Expected 
Array([,.0,,2021-01-05]), but got Array()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   Yes, Danny. After Bucket Rescale is completed, the data layout will change. 
Therefore, for Spark's Time Travel,something like travel to specific time point 
snapshot view (not Incremental 
Query)(https://hudi.apache.org/docs/sql_queries#time-travel-query), 
   
   Such as executing a query like `SELECT * FROM  TIMESTAMP AS OF 
 WHERE `, the Hudi would init 
specifiedQueryTimestamp through HoodieBaseRelation.
   ```
 protected lazy val specifiedQueryTimestamp: Option[String] =
   optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key)
 .map(HoodieSqlCommonUtils.formatQueryInstant)
   ```
   Then get `Schema` and build `fsView` based on `specifiedQueryTimestamp`
   
   For constructing the FsView, Hudi will call 
`getLatestMergedFileSlicesBeforeOrOn(String partitionStr, String 
maxInstantTime)`, travel fs view to the specified version. At this point, it is 
also necessary to load the corresponding hashing_config that was valid at that 
specific timestamp to ensure the historical data layout
   ```
 protected def listLatestFileSlices(globPaths: Seq[StoragePath], 
partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): 
Seq[FileSlice] = {
   queryTimestamp match {
 case Some(ts) =>
   specifiedQueryTimestamp.foreach(t => 
validateTimestampAsOf(metaClient, t))
   
   val partitionDirs = if (globPaths.isEmpty) {
 fileIndex.listFiles(partitionFilters, dataFilters)
   } else {
 val inMemoryFileIndex = 
HoodieInMemoryFileIndex.create(sparkSession, globPaths)
 inMemoryFileIndex.listFiles(partitionFilters, dataFilters)
   }
   
   val fsView = new HoodieTableFileSystemView(
 metaClient, timeline, 
sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs)
   .map(fileStatus => 
HadoopFSUtils.convertToStoragePathInfo(fileStatus))
   .asJava)
   
   fsView.getPartitionPaths.asScala.flatMap { partitionPath =>
 val relativePath = 
getRelativePartitionPath(convertToStoragePath(basePath), partitionPath)
 fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, 
ts).iterator().asScala
   }.toSeq
   
 case _ => Seq()
   }
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   Yes, Danny. After Bucket Rescale is completed, the data layout will change. 
Therefore, for Spark's Time Travel (not Incremental 
Query)(https://hudi.apache.org/docs/sql_queries#time-travel-query), 
   
   Something like travel to specific time point snapshot view
   
   Such as executing a query like `SELECT * FROM  TIMESTAMP AS OF 
 WHERE `, the implementation would involve 
specifying the specifiedQueryTimestamp through HoodieBaseRelation.
   ```
 protected lazy val specifiedQueryTimestamp: Option[String] =
   optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key)
 .map(HoodieSqlCommonUtils.formatQueryInstant)
   ```
   
   When constructing the FsView will call `getLatestMergedFileSlicesBeforeOrOn`
   ```
 protected def listLatestFileSlices(globPaths: Seq[StoragePath], 
partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): 
Seq[FileSlice] = {
   queryTimestamp match {
 case Some(ts) =>
   specifiedQueryTimestamp.foreach(t => 
validateTimestampAsOf(metaClient, t))
   
   val partitionDirs = if (globPaths.isEmpty) {
 fileIndex.listFiles(partitionFilters, dataFilters)
   } else {
 val inMemoryFileIndex = 
HoodieInMemoryFileIndex.create(sparkSession, globPaths)
 inMemoryFileIndex.listFiles(partitionFilters, dataFilters)
   }
   
   val fsView = new HoodieTableFileSystemView(
 metaClient, timeline, 
sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs)
   .map(fileStatus => 
HadoopFSUtils.convertToStoragePathInfo(fileStatus))
   .asJava)
   
   fsView.getPartitionPaths.asScala.flatMap { partitionPath =>
 val relativePath = 
getRelativePartitionPath(convertToStoragePath(basePath), partitionPath)
 fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, 
ts).iterator().asScala
   }.toSeq
   
 case _ => Seq()
   }
 }
   ```
   Travel the view to the specified version. At this point, it is also 
necessary to load the corresponding hashing_config that was valid at that 
specific timestamp to ensure the historical data layout



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028875514


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner 
createPartitionPruner(List dataFilters) {
+  private Option> 
getDataBucketFunc(List dataFilters) {
 if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) {
-  return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+  return Option.empty();
 }
 Set indexKeyFields = 
Arrays.stream(OptionsResolver.getIndexKeys(conf)).collect(Collectors.toSet());
 List indexKeyFilters = 
dataFilters.stream().filter(expr -> ExpressionUtils.isEqualsLitExpr(expr, 
indexKeyFields)).collect(Collectors.toList());
 if (!ExpressionUtils.isFilteringByAllFields(indexKeyFilters, 
indexKeyFields)) {
-  return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+  return Option.empty();

Review Comment:
   Add a new Spark UT `test("Test BucketID Pruning With Partition Bucket 
Index")`
   Without This PR will throw Exception 
   ```
   Expected Array([,.0,,2021-01-05]), but got Array()
   ScalaTestFailureLocation: 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at 
(HoodieSparkSqlTestBase.scala:135)
   org.scalatest.exceptions.TestFailedException: Expected 
Array([,.0,,2021-01-05]), but got Array()
   
   ```
   
   With PR in `Always load latest hashing config` logic, will throw exception
   ```
   Expected Array([,.0,,2021-01-05]), but got Array()
   ScalaTestFailureLocation: 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at 
(HoodieSparkSqlTestBase.scala:135)
   org.scalatest.exceptions.TestFailedException: Expected 
Array([,.0,,2021-01-05]), but got Array()
   ```



##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   Add a new Spark UT `test("Test BucketID Pruning With Partition Bucket 
Index")`
   Without This PR will throw Exception 
   ```
   Expected Array([,.0,,2021-01-05]), but got Array()
   ScalaTestFailureLocation: 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at 
(HoodieSparkSqlTestBase.scala:135)
   org.scalatest.exceptions.TestFailedException: Expected 
Array([,.0,,2021-01-05]), but got Array()
   
   ```
   
   With PR in `Always load latest hashing config` logic, will throw exception
   ```
   Expected Array([,.0,,2021-01-05]), but got Array()
   ScalaTestFailureLocation: 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at 
(HoodieSparkSqlTestBase.scala:135)
   org.scalatest.exceptions.TestFailedException: Expected 
Array([,.0,,2021-01-05]), but got Array()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028871226


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   For example 
   
   DeltaCommit1 ==> `C1_File1, C1_File2`
   Bucket-Rescale Commit2 ==> `C2_File1, C2_File2, C2_File3(Replaced  C1_File1, 
C1_File2)`
   DeltaCommit 3 ==> `C3_File1_Log1`
   Bucket-Rescale Commit4 ==> `C4_File1(Replaced C2_File1, C2_File2, C2_File3 
and C3_File1_Log1)`
   
   
   
   For Sql `SELECT * FROM hudi_table TIMESTAMP AS OF `, we need 
to load Bucket-Rescale Commit2 instead of load latest hashing config 
Bucket-Rescale Commit4



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2028869779


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   Yes, Danny. After Bucket Rescale is completed, the data layout will change. 
Therefore, for Spark's Time Travel (not Incremental 
Query)(https://hudi.apache.org/docs/sql_queries#time-travel-query), such as 
executing a query like `SELECT * FROM  TIMESTAMP AS OF  
WHERE `, the implementation would involve specifying the 
specifiedQueryTimestamp through HoodieBaseRelation.
   ```
 protected lazy val specifiedQueryTimestamp: Option[String] =
   optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key)
 .map(HoodieSqlCommonUtils.formatQueryInstant)
   ```
   
   When constructing the FsView will call `getLatestMergedFileSlicesBeforeOrOn`
   ```
 protected def listLatestFileSlices(globPaths: Seq[StoragePath], 
partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): 
Seq[FileSlice] = {
   queryTimestamp match {
 case Some(ts) =>
   specifiedQueryTimestamp.foreach(t => 
validateTimestampAsOf(metaClient, t))
   
   val partitionDirs = if (globPaths.isEmpty) {
 fileIndex.listFiles(partitionFilters, dataFilters)
   } else {
 val inMemoryFileIndex = 
HoodieInMemoryFileIndex.create(sparkSession, globPaths)
 inMemoryFileIndex.listFiles(partitionFilters, dataFilters)
   }
   
   val fsView = new HoodieTableFileSystemView(
 metaClient, timeline, 
sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs)
   .map(fileStatus => 
HadoopFSUtils.convertToStoragePathInfo(fileStatus))
   .asJava)
   
   fsView.getPartitionPaths.asScala.flatMap { partitionPath =>
 val relativePath = 
getRelativePartitionPath(convertToStoragePath(basePath), partitionPath)
 fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, 
ts).iterator().asScala
   }.toSeq
   
 case _ => Seq()
   }
 }
   ```
   Travel the view to the specified version. At this point, it is also 
necessary to load the corresponding hashing_config that was valid at that 
specific timestamp to ensure the historical data layout



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778780676

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651)
 
   * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654)
 
   * b3a978b204ecd08df87cf4116647ce95088eedc0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778784721

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654)
 
   * b3a978b204ecd08df87cf4116647ce95088eedc0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778796309

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654)
 
   * b3a978b204ecd08df87cf4116647ce95088eedc0 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4656)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778652838

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651)
 
   * 2d79e40b46ab9af5b518314c9031bfc46434841d Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4654)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778359133

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


danny0405 commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778250151

   Thanks for the contribution, I have made some refactoring with the given 
patch:
   
   
[13060.patch.zip](https://github.com/user-attachments/files/19602219/13060.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-04 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2778119931

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619)
 
   * b53f457aa1ef63f665add37a75d4f8e9a614780c Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4651)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2027921234


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   > Maybe we should load the hashing configuration associated with R3 (the 
latest bucket rescale operation before or equal to C4).
   
   We can not do that because the data file layouts had been changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2027283054


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner 
createPartitionPruner(List dataFilters) {
+  private Option> 
getDataBucketFunc(List dataFilters) {
 if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) {
-  return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+  return Option.empty();

Review Comment:
   > It looks like the return value is always non-empty, can we just return the 
function instead of the option instead?
   
   Here may return `Option.empty()` when `getDataBucketFunc`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2027280057


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java:
##
@@ -69,7 +70,7 @@ public class FileIndex implements Serializable {
   private final org.apache.hadoop.conf.Configuration hadoopConf;
   private final PartitionPruners.PartitionPruner partitionPruner; // for 
partition pruning
   private final ColumnStatsProbe colStatsProbe;   // for 
probing column stats
-  private final int dataBucketHashing;   // 
for bucket pruning
+  private final Option> dataBucket;  
 // for bucket pruning

Review Comment:
   chenged



##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -158,7 +158,7 @@ public class HoodieTableSource implements
   private List predicates;
   private ColumnStatsProbe columnStatsProbe;
   private PartitionPruners.PartitionPruner partitionPruner;
-  private int dataBucketHashing;
+  private Option> dataBucket;

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2027283458


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -365,16 +368,16 @@ private PartitionPruners.PartitionPruner 
createPartitionPruner(List dataFilters) {
+  private Option> 
getDataBucketFunc(List dataFilters) {
 if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) {
-  return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+  return Option.empty();
 }
 Set indexKeyFields = 
Arrays.stream(OptionsResolver.getIndexKeys(conf)).collect(Collectors.toSet());
 List indexKeyFilters = 
dataFilters.stream().filter(expr -> ExpressionUtils.isEqualsLitExpr(expr, 
indexKeyFields)).collect(Collectors.toList());
 if (!ExpressionUtils.isFilteringByAllFields(indexKeyFilters, 
indexKeyFields)) {
-  return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+  return Option.empty();

Review Comment:
   > It looks like the return value is always non-empty, can we just return the 
function instead of the option instead?
   
   Here also may return `Option.empty()` when `getDataBucketFunc`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776622457

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776364946

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603)
 
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4619)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2027280488


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -179,7 +180,7 @@ public HoodieTableSource(
   @Nullable List predicates,
   @Nullable ColumnStatsProbe columnStatsProbe,
   @Nullable PartitionPruners.PartitionPruner partitionPruner,
-  int dataBucketHashing,
+  Option> dataBucket,

Review Comment:
   all changed



##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##
@@ -698,12 +699,7 @@ public ColumnStatsProbe getColumnStatsProbe() {
   }
 
   @VisibleForTesting
-  public int getDataBucket() {
-return BucketIdentifier.getBucketId(this.dataBucketHashing, 
conf.getInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS));
-  }
-
-  @VisibleForTesting
-  public int getDataBucketHashing() {
-return dataBucketHashing;
+  public Option> getDataBucket() {

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2027284301


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1321,6 +1321,49 @@ void testBulkInsertWithPartitionBucketIndex(String 
operationType, String tableTy
 assertEquals(expected.stream().sorted().collect(Collectors.toList()), 
actual.stream().sorted().collect(Collectors.toList()));
   }
 
+  @Test
+  void tesQueryWithPartitionBucketIndexPruning() {
+String operationType = "upsert";
+String tableType = "MERGE_ON_READ";
+TableEnvironment tableEnv = batchTableEnv;
+// csv source
+String csvSourceDDL = TestConfigurations.getCsvSourceDDL("csv_source", 
"test_source_5.data");
+tableEnv.executeSql(csvSourceDDL);
+String catalogName = "hudi_" + operationType;
+String hudiCatalogDDL = catalog(catalogName)
+.catalogPath(tempFile.getAbsolutePath())
+.end();
+
+tableEnv.executeSql(hudiCatalogDDL);
+String dbName = "hudi";
+tableEnv.executeSql("create database " + catalogName + "." + dbName);
+String basePath = tempFile.getAbsolutePath() + "/hudi/hoodie_sink";
+
+String hoodieTableDDL = sql(catalogName + ".hudi.hoodie_sink")
+.option(FlinkOptions.PATH, basePath)
+.option(FlinkOptions.TABLE_TYPE, tableType)
+.option(FlinkOptions.OPERATION, operationType)
+.option(FlinkOptions.WRITE_BULK_INSERT_SHUFFLE_INPUT, true)
+.option(FlinkOptions.INDEX_TYPE, "BUCKET")
+.option(FlinkOptions.HIVE_STYLE_PARTITIONING, "true")
+.option(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, "1")
+.option(FlinkOptions.BUCKET_INDEX_PARTITION_RULE, "regex")
+.option(FlinkOptions.BUCKET_INDEX_PARTITION_EXPRESSIONS, 
"partition=(par1|par2),2")
+.end();
+tableEnv.executeSql(hoodieTableDDL);
+
+String insertInto = "insert into " + catalogName + ".hudi.hoodie_sink 
select * from csv_source";
+execInsertSql(tableEnv, insertInto);
+
+List result1 = CollectionUtil.iterableToList(

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2776210253

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603)
 
   * 6b38a5b0831e624603ada818aa19f7368336dee1 UNKNOWN
   * 1610e8ca29bc433cda1a20d7d31d98e2d1bd7bd5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026501136


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java:
##
@@ -69,7 +70,7 @@ public class FileIndex implements Serializable {
   private final org.apache.hadoop.conf.Configuration hadoopConf;
   private final PartitionPruners.PartitionPruner partitionPruner; // for 
partition pruning
   private final ColumnStatsProbe colStatsProbe;   // for 
probing column stats
-  private final int dataBucketHashing;   // 
for bucket pruning
+  private final Option> dataBucket;  
 // for bucket pruning

Review Comment:
   dataBucket -> dataBucketFunc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026505878


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java:
##
@@ -43,9 +45,7 @@
 public class PrimaryKeyPruners {
   private static final Logger LOG = 
LoggerFactory.getLogger(PrimaryKeyPruners.class);
 
-  public static final int BUCKET_ID_NO_PRUNING = -1;
-
-  public static int getBucketFieldHashing(List 
hashKeyFilters, Configuration conf) {
+  public static Option> 
getBucketId(List hashKeyFilters, Configuration conf) {

Review Comment:
   It looks like the return value is always non-empty, can we just return the 
function instead of the option instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026504533


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java:
##
@@ -43,9 +45,7 @@
 public class PrimaryKeyPruners {
   private static final Logger LOG = 
LoggerFactory.getLogger(PrimaryKeyPruners.class);
 
-  public static final int BUCKET_ID_NO_PRUNING = -1;
-
-  public static int getBucketFieldHashing(List 
hashKeyFilters, Configuration conf) {
+  public static Option> 
getBucketId(List hashKeyFilters, Configuration conf) {

Review Comment:
   getBucketId -> getBucketIdFunc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026500443


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java:
##
@@ -69,7 +70,7 @@ public class FileIndex implements Serializable {
   private final org.apache.hadoop.conf.Configuration hadoopConf;
   private final PartitionPruners.PartitionPruner partitionPruner; // for 
partition pruning
   private final ColumnStatsProbe colStatsProbe;   // for 
probing column stats
-  private final int dataBucketHashing;   // 
for bucket pruning
+  private final Option> dataBucket;  
 // for bucket pruning

Review Comment:
   Can we use Java function, can we align the indentation of the comments:
   
   ```java
 // for partition pruning
 // for probing column stats
 // for bucket pruning
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774889183

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-03 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774730016

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602)
 
   * 95c0d03c1707ea4e7d119f5751d042ad6a742664 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4603)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774619228

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552)
 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553)
 
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774626980

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * cbb666d497e8b4ae98e342197ac951f01bbf0a2c Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4602)
 
   * 95c0d03c1707ea4e7d119f5751d042ad6a742664 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774605936

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552)
 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553)
 
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


hudi-bot commented on PR #13060:
URL: https://github.com/apache/hudi/pull/13060#issuecomment-2774614184

   
   ## CI report:
   
   * cb8f72d1483384de0f72270bdb3bba4001380f8f UNKNOWN
   * 6f0939dfbbcddd07213c638d01933c25fe486941 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4552)
 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=4553)
 
   * 2bafa0c28b3f912a5cdac376387eb1e8f6edaa83 UNKNOWN
   * cbb666d497e8b4ae98e342197ac951f01bbf0a2c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026283113


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java:
##
@@ -45,7 +45,7 @@ public class PrimaryKeyPruners {
 
   public static final int BUCKET_ID_NO_PRUNING = -1;
 
-  public static int getBucketId(List hashKeyFilters, 
Configuration conf) {
+  public static int getBucketFieldHashing(List 
hashKeyFilters, Configuration conf) {

Review Comment:
   Sure, changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2026282689


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java:
##
@@ -68,25 +69,30 @@ public class FileIndex implements Serializable {
   private final org.apache.hadoop.conf.Configuration hadoopConf;
   private final PartitionPruners.PartitionPruner partitionPruner; // for 
partition pruning
   private final ColumnStatsProbe colStatsProbe;   // for 
probing column stats
-  private final int dataBucket;   // for 
bucket pruning
+  private final int dataBucketHashing;   // 
for bucket pruning
   private List partitionPaths;// cache of 
partition paths
   private final FileStatsIndex fileStatsIndex;// for data 
skipping
+  private final NumBucketsFunction numBucketsFunction;
 
   private FileIndex(
   StoragePath path,
   Configuration conf,
   RowType rowType,
   ColumnStatsProbe colStatsProbe,
   PartitionPruners.PartitionPruner partitionPruner,
-  int dataBucket) {
+  int dataBucketHashing,
+  NumBucketsFunction numBucketsFunction) {
 this.path = path;
 this.hadoopConf = HadoopConfigurations.getHadoopConf(conf);
 this.tableExists = StreamerUtil.tableExists(path.toString(), hadoopConf);
 this.metadataConfig = StreamerUtil.metadataConfig(conf);
 this.colStatsProbe = 
isDataSkippingFeasible(conf.get(FlinkOptions.READ_DATA_SKIPPING_ENABLED)) ? 
colStatsProbe : null;
 this.partitionPruner = partitionPruner;
-this.dataBucket = dataBucket;

Review Comment:
   evolve this dataBucket to a function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2025050558


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);

Review Comment:
   Yes we can.
   Currently, we have two methods to use partition-level bucket indexing:
   
   1. Enabling during table creation via DDL: When enabled during table 
creation, this method initializes a 0.hashing_config file through 
the catalog. 
   
   2. Upgrading existing table-level bucket indexes via the CALL command: For 
tables that already use table-level bucket indexing, invoking a CALL command 
triggers an upgrade process. This generates a replace-commit instant  and 
initializes a corresponding .hashing_config file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2025037864


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java:
##
@@ -45,7 +45,7 @@ public class PrimaryKeyPruners {
 
   public static final int BUCKET_ID_NO_PRUNING = -1;
 
-  public static int getBucketId(List hashKeyFilters, 
Configuration conf) {
+  public static int getBucketFieldHashing(List 
hashKeyFilters, Configuration conf) {

Review Comment:
   Sorry Danny, I didn't get this. Is that possible to get full partition path 
during original dataBucket computation?
   ```
 @Override
 public Result applyFilters(List filters) {
   List simpleFilters = 
filterSimpleCallExpression(filters);
   Tuple2, List> splitFilters 
= splitExprByPartitionCall(simpleFilters, this.partitionKeys, 
this.tableRowType);
   this.predicates = ExpressionPredicates.fromExpression(splitFilters.f0);
   this.columnStatsProbe = ColumnStatsProbe.newInstance(splitFilters.f0);
   this.partitionPruner = createPartitionPruner(splitFilters.f1, 
columnStatsProbe);
   this.dataBucket = getDataBucket(splitFilters.f0);
   // refuse all the filters now
   return SupportsFilterPushDown.Result.of(new 
ArrayList<>(splitFilters.f1), new ArrayList<>(filters));
 }
   ```
   
   What is PR did is get and pass hashing value to `getFilesInPartitions`, then 
compute numBuckets , finally compute the final bucket id `hashing value % 
numBuckets`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2025037864


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java:
##
@@ -45,7 +45,7 @@ public class PrimaryKeyPruners {
 
   public static final int BUCKET_ID_NO_PRUNING = -1;
 
-  public static int getBucketId(List hashKeyFilters, 
Configuration conf) {
+  public static int getBucketFieldHashing(List 
hashKeyFilters, Configuration conf) {

Review Comment:
   Sorry Danny, I didn't get this. Is that possible to get full partition path 
and use it in this new bucketIdFunc In the original code's call site?
   ```
 @Override
 public Result applyFilters(List filters) {
   List simpleFilters = 
filterSimpleCallExpression(filters);
   Tuple2, List> splitFilters 
= splitExprByPartitionCall(simpleFilters, this.partitionKeys, 
this.tableRowType);
   this.predicates = ExpressionPredicates.fromExpression(splitFilters.f0);
   this.columnStatsProbe = ColumnStatsProbe.newInstance(splitFilters.f0);
   this.partitionPruner = createPartitionPruner(splitFilters.f1, 
columnStatsProbe);
   this.dataBucket = getDataBucket(splitFilters.f0);
   // refuse all the filters now
   return SupportsFilterPushDown.Result.of(new 
ArrayList<>(splitFilters.f1), new ArrayList<>(filters));
 }
   ```
   
   What is PR did is get and pass hashing value to `getFilesInPartitions`, then 
compute numBuckets , finally compute the final bucket id `hashing value % 
numBuckets`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-02 Thread via GitHub


zhangyue19921010 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2025053222


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   The method `getHashingConfigInstantToLoadBeforeOrOn` is designed to handle 
Time Travel scenarios. For example, suppose we have a sequence of commits and 
replace-commits:
   
   C1, C2, R3 (a bucket rescale operation), C4, R5 (another bucket rescale).
   
   If we perform a Time Travel query targeting commit C4 (i.e., 
specifiedQueryInstant = C4), Maybe we should load the hashing configuration 
associated with R3 (the latest bucket rescale operation before or equal to C4).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-04-01 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2021975825


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java:
##
@@ -68,25 +69,30 @@ public class FileIndex implements Serializable {
   private final org.apache.hadoop.conf.Configuration hadoopConf;
   private final PartitionPruners.PartitionPruner partitionPruner; // for 
partition pruning
   private final ColumnStatsProbe colStatsProbe;   // for 
probing column stats
-  private final int dataBucket;   // for 
bucket pruning
+  private final int dataBucketHashing;   // 
for bucket pruning
   private List partitionPaths;// cache of 
partition paths
   private final FileStatsIndex fileStatsIndex;// for data 
skipping
+  private final NumBucketsFunction numBucketsFunction;
 
   private FileIndex(
   StoragePath path,
   Configuration conf,
   RowType rowType,
   ColumnStatsProbe colStatsProbe,
   PartitionPruners.PartitionPruner partitionPruner,
-  int dataBucket) {
+  int dataBucketHashing,
+  NumBucketsFunction numBucketsFunction) {
 this.path = path;
 this.hadoopConf = HadoopConfigurations.getHadoopConf(conf);
 this.tableExists = StreamerUtil.tableExists(path.toString(), hadoopConf);
 this.metadataConfig = StreamerUtil.metadataConfig(conf);
 this.colStatsProbe = 
isDataSkippingFeasible(conf.get(FlinkOptions.READ_DATA_SKIPPING_ENABLED)) ? 
colStatsProbe : null;
 this.partitionPruner = partitionPruner;
-this.dataBucket = dataBucket;

Review Comment:
   Can we still use the bucket id here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-03-31 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2021975643


##
hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java:
##
@@ -196,24 +196,51 @@ public static Option 
loadHashingConfig(Hoodie
 
   /**
* Get Latest committed hashing config instant to load.
+   * If instant is empty, then return latest hashing config instant
*/
-  public static String 
getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option 
getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option 
instant) {
 try {
   List allCommittedHashingConfig = 
getCommittedHashingConfigInstants(metaClient);
-  return allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 
1);
+  if (instant.isPresent()) {
+Option res = 
getHashingConfigInstantToLoadBeforeOrOn(allCommittedHashingConfig, 
instant.get());
+// fall back to look up archived hashing config instant before return 
empty
+return res.isPresent() ? res : 
getHashingConfigInstantToLoadBeforeOrOn(getArchiveHashingConfigInstants(metaClient),
 instant.get());
+  } else {
+return 
Option.of(allCommittedHashingConfig.get(allCommittedHashingConfig.size() - 1));
+  }
 } catch (Exception e) {
   throw new HoodieException("Failed to get hashing config instant to 
load.", e);
 }
   }
 
+  private static Option 
getHashingConfigInstantToLoadBeforeOrOn(List hashingConfigInstants, 
String instant) {

Review Comment:
   We should always load the latest bucket config to comply with the latest 
bucket id mappings. Otherwise the query would fail. We do not ensure snapshot 
isolation here because the reader read being affected by the writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-03-31 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2021978502


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -103,6 +104,15 @@ case class HoodieFileIndex(spark: SparkSession,
 endCompletionTime = options.get(DataSourceReadOptions.END_COMMIT.key)) 
with FileIndex {
 
   @transient protected var hasPushedDownPartitionPredicates: Boolean = false
+  private val isPartitionSimpleBucketIndex = 
PartitionBucketIndexUtils.isPartitionSimpleBucketIndex(spark.sparkContext.hadoopConfiguration,
+metaClient.getBasePath.toString)
+
+  @transient private lazy val bucketIndexSupport = if 
(isPartitionSimpleBucketIndex) {
+val specifiedQueryInstant = 
options.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key).map(HoodieSqlCommonUtils.formatQueryInstant)
+new PartitionBucketIndexSupport(spark, metadataConfig, metaClient, 
specifiedQueryInstant)

Review Comment:
   always query from the latest hash config because there is no SI for 
reader/writers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-03-31 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2021978135


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestPartitionBucketPruning.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.model.PartitionBucketIndexHashingConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.index.bucket.BucketIdentifier;
+import org.apache.hudi.source.prune.PrimaryKeyPruners;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.hudi.storage.StoragePathInfo;
+import org.apache.hudi.util.SerializableSchema;
+import org.apache.hudi.util.StreamerUtil;
+import org.apache.hudi.utils.TestConfigurations;
+import org.apache.hudi.utils.TestData;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.table.api.DataTypes;
+import org.apache.flink.table.expressions.CallExpression;
+import org.apache.flink.table.expressions.FieldReferenceExpression;
+import org.apache.flink.table.expressions.ResolvedExpression;
+import org.apache.flink.table.expressions.ValueLiteralExpression;
+import org.apache.flink.table.functions.BuiltInFunctionDefinitions;
+import org.apache.flink.table.types.DataType;
+import org.apache.hadoop.fs.Path;
+import org.hamcrest.CoreMatchers;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import static org.hamcrest.MatcherAssert.assertThat;
+import static org.hamcrest.core.Is.is;
+
+public class TestPartitionBucketPruning {

Review Comment:
   Should we move the tests into `TestHoodieTableSource`, can we at least add a 
IT test for the query result validation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-8990] Partition bucket index supports query pruning based on bucket id [hudi]

2025-03-31 Thread via GitHub


danny0405 commented on code in PR #13060:
URL: https://github.com/apache/hudi/pull/13060#discussion_r2021977009


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java:
##
@@ -45,7 +45,7 @@ public class PrimaryKeyPruners {
 
   public static final int BUCKET_ID_NO_PRUNING = -1;
 
-  public static int getBucketId(List hashKeyFilters, 
Configuration conf) {
+  public static int getBucketFieldHashing(List 
hashKeyFilters, Configuration conf) {

Review Comment:
   Can we still return the bucket id? We can add a new param for the function:
   
   ```java
   
 // the input of bucketIdFunc is the partition path
 getBucketId(List hashKeyFilters, Configuration conf, 
Function bucketIdFunc)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org