Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-26 Thread via GitHub


voonhous merged PR #17632:
URL: https://github.com/apache/hudi/pull/17632


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-26 Thread via GitHub


voonhous commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3693101279

   https://github.com/user-attachments/assets/1c17fc2f-bd65-4d08-bdea-a6d399c62452";
 />
   
   Azure and GitHub CI are both green, merging it in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-26 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3693060031

   
   ## CI report:
   
   * e9773596e3b258e1003d8b10b485aa7837a3999d Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10581)
 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10530)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-26 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3693058532

   
   ## CI report:
   
   * e9773596e3b258e1003d8b10b485aa7837a3999d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690519866

   
   ## CI report:
   
   * e9773596e3b258e1003d8b10b485aa7837a3999d Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10530)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690444054

   
   ## CI report:
   
   * b1803e9855bac95a60cd2bea120be408308caa6f Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10525)
 
   * e9773596e3b258e1003d8b10b485aa7837a3999d Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10530)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690440289

   
   ## CI report:
   
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10522)
 
   * b1803e9855bac95a60cd2bea120be408308caa6f Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10525)
 
   * e9773596e3b258e1003d8b10b485aa7837a3999d Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10530)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690438443

   
   ## CI report:
   
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10522)
 
   * b1803e9855bac95a60cd2bea120be408308caa6f Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10525)
 
   * e9773596e3b258e1003d8b10b485aa7837a3999d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2646152643


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,490 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.SafeProjection;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadWithNulls() throws Exception {
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("value", DataTypes.DoubleType, true);
+
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 100.0));
+expectedRows.add(createRow(2, null, 200.0));  // null name
+expectedRows.add(createRow(3, "Charlie", null));  // null value
+expectedRows.add(createRow(4, null, null));  // multiple nulls
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_nulls.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  List actualRows = readAllRows(reader, schema);
+
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify nulls are preserved
+  assertTrue(actualRows.get(1).isNullAt(1), "Second row name should be 
null");
+  assertFalse(actualRows.get(1).isNullAt(2), "Second row value should not 
be null");
+
+  assertFalse(actualRows.get(2).isNullAt(1), "Third row name should not be 
null");
+  assertTrue(actualRows.get(2).isNullAt(2), "Third row value should be 
null");
+
+  assertTrue(actualRows.get(3).isNullAt(1), "Fourth row name should be 
null");
+  assertTrue(actualRows.get(3).isNullAt(2), "Fourth row

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690349222

   
   ## CI report:
   
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 Azure: 
[CANCELED](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10522)
 
   * b1803e9855bac95a60cd2bea120be408308caa6f Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10525)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690343263

   
   ## CI report:
   
   * df85fb2395a9beb597d0f4f23d7604ace05306ac Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10520)
 
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10522)
 
   * b1803e9855bac95a60cd2bea120be408308caa6f Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10525)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690341220

   
   ## CI report:
   
   * df85fb2395a9beb597d0f4f23d7604ace05306ac Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10520)
 
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10522)
 
   * b1803e9855bac95a60cd2bea120be408308caa6f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2646063578


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,534 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadWithNulls() throws Exception {
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("value", DataTypes.DoubleType, true);
+
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 100.0));
+expectedRows.add(createRow(2, null, 200.0));  // null name
+expectedRows.add(createRow(3, "Charlie", null));  // null value
+expectedRows.add(createRow(4, null, null));  // multiple nulls
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_nulls.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  List actualRows = readAllRows(reader, schema);
+
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify nulls are preserved
+  assertTrue(actualRows.get(1).isNullAt(1), "Second row name should be 
null");
+  assertFalse(actualRows.get(1).isNullAt(2), "Second row value should not 
be null");
+
+  assertFalse(actualRows.get(2).isNullAt(1), "Third row name should not be 
null");
+  assertTrue(actualRows.get(2).isNullAt(2), "Third row value should be 
null");
+
+  assertTrue(actualRows.get(3).isNullAt(1), "Fourth row name should be 
null");
+  assertTrue(actualRows.get(3).isNullAt(2), "Fourth row value should be 
nul

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690246217

   
   ## CI report:
   
   * df85fb2395a9beb597d0f4f23d7604ace05306ac Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10520)
 
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10522)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690207245

   
   ## CI report:
   
   * df85fb2395a9beb597d0f4f23d7604ace05306ac Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10520)
 
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2646006325


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,534 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadWithNulls() throws Exception {
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("value", DataTypes.DoubleType, true);
+
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 100.0));
+expectedRows.add(createRow(2, null, 200.0));  // null name
+expectedRows.add(createRow(3, "Charlie", null));  // null value
+expectedRows.add(createRow(4, null, null));  // multiple nulls
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_nulls.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  List actualRows = readAllRows(reader, schema);
+
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify nulls are preserved
+  assertTrue(actualRows.get(1).isNullAt(1), "Second row name should be 
null");
+  assertFalse(actualRows.get(1).isNullAt(2), "Second row value should not 
be null");
+
+  assertFalse(actualRows.get(2).isNullAt(1), "Third row name should not be 
null");
+  assertTrue(actualRows.get(2).isNullAt(2), "Third row value should be 
null");
+
+  assertTrue(actualRows.get(3).isNullAt(1), "Fourth row name should be 
null");
+  assertTrue(actualRows.get(3).isNullAt(2), "Fourth row value should be 
nul

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2645997851


##
hudi-common/pom.xml:
##
@@ -178,6 +178,16 @@
   parquet-avro
 
 
+
+
+  org.apache.arrow
+  arrow-vector
+
+
+  org.apache.arrow
+  arrow-memory-netty
+
+

Review Comment:
   Can this be removed now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690190743

   
   ## CI report:
   
   * f69f6c533a4104c104f2edbf855464c50fe5b17b Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10502)
 
   * df85fb2395a9beb597d0f4f23d7604ace05306ac Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10520)
 
   * 26c87a550cc3866de4a1f5474e0ce6e403bff587 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2645995893


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -417,15 +404,55 @@ public void testReadWithMetadataPopulation() throws 
Exception {
 
   /**
* Helper method to read all rows from a reader into a list.
-   * Makes copies of InternalRows to avoid reuse issues.
+   * Converts UnsafeRow to GenericInternalRow by extracting field values using 
type-specific getters.
*/
-  private List readAllRows(HoodieSparkLanceReader reader) throws 
IOException {
+  private List readAllRows(HoodieSparkLanceReader reader, 
StructType schema) throws IOException {
 List rows = new ArrayList<>();
-HoodieSchema schema = reader.getSchema();
-try (ClosableIterator> iterator = 
reader.getRecordIterator(schema)) {
+HoodieSchema hoodieSchema = reader.getSchema();
+
+try (ClosableIterator> iterator = 
reader.getRecordIterator(hoodieSchema)) {
   while (iterator.hasNext()) {
 HoodieRecord record = iterator.next();
-rows.add(record.getData().copy());
+InternalRow unsafeRow = record.getData();
+
+// Extract all field values from UnsafeRow using type-specific getters
+Object[] values = new Object[schema.fields().length];
+for (int i = 0; i < schema.fields().length; i++) {
+  if (unsafeRow.isNullAt(i)) {
+values[i] = null;
+  } else {
+org.apache.spark.sql.types.DataType dataType = 
schema.fields()[i].dataType();

Review Comment:
   to fix this



##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -417,15 +404,55 @@ public void testReadWithMetadataPopulation() throws 
Exception {
 
   /**
* Helper method to read all rows from a reader into a list.
-   * Makes copies of InternalRows to avoid reuse issues.
+   * Converts UnsafeRow to GenericInternalRow by extracting field values using 
type-specific getters.
*/
-  private List readAllRows(HoodieSparkLanceReader reader) throws 
IOException {
+  private List readAllRows(HoodieSparkLanceReader reader, 
StructType schema) throws IOException {
 List rows = new ArrayList<>();
-HoodieSchema schema = reader.getSchema();
-try (ClosableIterator> iterator = 
reader.getRecordIterator(schema)) {
+HoodieSchema hoodieSchema = reader.getSchema();
+
+try (ClosableIterator> iterator = 
reader.getRecordIterator(hoodieSchema)) {
   while (iterator.hasNext()) {
 HoodieRecord record = iterator.next();
-rows.add(record.getData().copy());
+InternalRow unsafeRow = record.getData();
+
+// Extract all field values from UnsafeRow using type-specific getters
+Object[] values = new Object[schema.fields().length];
+for (int i = 0; i < schema.fields().length; i++) {
+  if (unsafeRow.isNullAt(i)) {
+values[i] = null;
+  } else {
+org.apache.spark.sql.types.DataType dataType = 
schema.fields()[i].dataType();

Review Comment:
   to fix this to not use full path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690152414

   
   ## CI report:
   
   * f69f6c533a4104c104f2edbf855464c50fe5b17b Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10502)
 
   * df85fb2395a9beb597d0f4f23d7604ace05306ac Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10520)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3690149857

   
   ## CI report:
   
   * f69f6c533a4104c104f2edbf855464c50fe5b17b Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10502)
 
   * df85fb2395a9beb597d0f4f23d7604ace05306ac UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2645968190


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,513 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadPrimitiveTypes() throws Exception {
+// Create schema with primitive types
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("age", DataTypes.LongType, true)
+.add("score", DataTypes.DoubleType, true)
+.add("active", DataTypes.BooleanType, true);
+
+// Create test data
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 30L, 95.5, true));
+expectedRows.add(createRow(2, "Bob", 25L, 87.3, false));
+expectedRows.add(createRow(3, "Charlie", 35L, 92.1, true));
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_primitives.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  // Verify record count
+  assertEquals(expectedRows.size(), reader.getTotalRecords(), "Record 
count should match");
+
+  // Verify schema
+  assertNotNull(reader.getSchema(), "Schema should not be null");
+
+  // Read all records
+  List actualRows = readAllRows(reader);
+
+  // Verify record count
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify each record
+  for (int i = 0; i < expectedRows.size(); i++) {
+InternalRow expected = expectedRows.get(i);
+InternalRow actual = actualRows.get(i);
+
+assertEquals(expected.getInt(0), actual.getInt(0), "id field should 
match");

Review Comment:
   addressed in recent commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-24 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644780106


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> uns

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644780106


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> uns

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688518247

   
   ## CI report:
   
   * f69f6c533a4104c104f2edbf855464c50fe5b17b Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10502)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644658587


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator,

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688402597

   
   ## CI report:
   
   * ac8bb909fec7b71d16d6c8e0ffd0b1ec7ac0db44 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10499)
 
   * f69f6c533a4104c104f2edbf855464c50fe5b17b Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10502)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688357884

   
   ## CI report:
   
   * 912b385b4f57de367898ba71085efccf5d164959 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10490)
 
   * ac8bb909fec7b71d16d6c8e0ffd0b1ec7ac0db44 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10499)
 
   * f69f6c533a4104c104f2edbf855464c50fe5b17b Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10502)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688307472

   
   ## CI report:
   
   * 912b385b4f57de367898ba71085efccf5d164959 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10490)
 
   * ac8bb909fec7b71d16d6c8e0ffd0b1ec7ac0db44 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10499)
 
   * f69f6c533a4104c104f2edbf855464c50fe5b17b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644523188


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {

Review Comment:
   added a test



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agr

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688281041

   
   ## CI report:
   
   * 912b385b4f57de367898ba71085efccf5d164959 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10490)
 
   * ac8bb909fec7b71d16d6c8e0ffd0b1ec7ac0db44 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10499)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688278466

   
   ## CI report:
   
   * 912b385b4f57de367898ba71085efccf5d164959 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10490)
 
   * ac8bb909fec7b71d16d6c8e0ffd0b1ec7ac0db44 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r268194


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> uns

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r268194


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> uns

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644434837


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> uns

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644434837


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> uns

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688173874

   
   ## CI report:
   
   * 912b385b4f57de367898ba71085efccf5d164959 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10490)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3688149470

   
   ## CI report:
   
   * 9b7dfb1c712b120065844e3497a72d4be51e15d0 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10485)
 
   * 912b385b4f57de367898ba71085efccf5d164959 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10490)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644376225


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,513 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadPrimitiveTypes() throws Exception {
+// Create schema with primitive types
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("age", DataTypes.LongType, true)
+.add("score", DataTypes.DoubleType, true)
+.add("active", DataTypes.BooleanType, true);
+
+// Create test data
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 30L, 95.5, true));
+expectedRows.add(createRow(2, "Bob", 25L, 87.3, false));
+expectedRows.add(createRow(3, "Charlie", 35L, 92.1, true));
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_primitives.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  // Verify record count
+  assertEquals(expectedRows.size(), reader.getTotalRecords(), "Record 
count should match");
+
+  // Verify schema
+  assertNotNull(reader.getSchema(), "Schema should not be null");
+
+  // Read all records
+  List actualRows = readAllRows(reader);
+
+  // Verify record count
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify each record
+  for (int i = 0; i < expectedRows.size(); i++) {
+InternalRow expected = expectedRows.get(i);
+InternalRow actual = actualRows.get(i);
+
+assertEquals(expected.getInt(0), actual.getInt(0), "id field should 
match");

Review Comment:
   If we can convert the unsafe row to the GenericInternalRow I think that will 
make it a lot easier as we add more types or extend t

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644362952


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,513 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadPrimitiveTypes() throws Exception {
+// Create schema with primitive types
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("age", DataTypes.LongType, true)
+.add("score", DataTypes.DoubleType, true)
+.add("active", DataTypes.BooleanType, true);
+
+// Create test data
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 30L, 95.5, true));
+expectedRows.add(createRow(2, "Bob", 25L, 87.3, false));
+expectedRows.add(createRow(3, "Charlie", 35L, 92.1, true));
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_primitives.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  // Verify record count
+  assertEquals(expectedRows.size(), reader.getTotalRecords(), "Record 
count should match");
+
+  // Verify schema
+  assertNotNull(reader.getSchema(), "Schema should not be null");
+
+  // Read all records
+  List actualRows = readAllRows(reader);
+
+  // Verify record count
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify each record
+  for (int i = 0; i < expectedRows.size(); i++) {
+InternalRow expected = expectedRows.get(i);
+InternalRow actual = actualRows.get(i);
+
+assertEquals(expected.getInt(0), actual.getInt(0), "id field should 
match");

Review Comment:
   @the-other-tim-brown 
   
   When doing an assertion on just the rows such as  ` assertEquals(expected, 
actual);` this fails for 
   ```
   or

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644362952


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,513 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadPrimitiveTypes() throws Exception {
+// Create schema with primitive types
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("age", DataTypes.LongType, true)
+.add("score", DataTypes.DoubleType, true)
+.add("active", DataTypes.BooleanType, true);
+
+// Create test data
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 30L, 95.5, true));
+expectedRows.add(createRow(2, "Bob", 25L, 87.3, false));
+expectedRows.add(createRow(3, "Charlie", 35L, 92.1, true));
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_primitives.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  // Verify record count
+  assertEquals(expectedRows.size(), reader.getTotalRecords(), "Record 
count should match");
+
+  // Verify schema
+  assertNotNull(reader.getSchema(), "Schema should not be null");
+
+  // Read all records
+  List actualRows = readAllRows(reader);
+
+  // Verify record count
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify each record
+  for (int i = 0; i < expectedRows.size(); i++) {
+InternalRow expected = expectedRows.get(i);
+InternalRow actual = actualRows.get(i);
+
+assertEquals(expected.getInt(0), actual.getInt(0), "id field should 
match");

Review Comment:
   @the-other-tim-brown 
   
   When doing an assertion on just the rows such as  ` assertEquals(expected, 
actual);` this fails for 
   ```
   or

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644362952


##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceReader.java:
##
@@ -0,0 +1,513 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.testutils.HoodieTestUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.io.storage.LanceTestUtils.createRow;
+import static 
org.apache.hudi.io.storage.LanceTestUtils.createRowWithMetaFields;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Tests for {@link HoodieSparkLanceReader}.
+ */
+public class TestHoodieSparkLanceReader {
+
+  @TempDir
+  File tempDir;
+
+  private HoodieStorage storage;
+  private SparkTaskContextSupplier taskContextSupplier;
+  private String instantTime;
+
+  @BeforeEach
+  public void setUp() throws IOException {
+storage = HoodieTestUtils.getStorage(tempDir.getAbsolutePath());
+taskContextSupplier = new SparkTaskContextSupplier();
+instantTime = "2025120112000";
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+if (storage != null) {
+  storage.close();
+}
+  }
+
+  @Test
+  public void testReadPrimitiveTypes() throws Exception {
+// Create schema with primitive types
+StructType schema = new StructType()
+.add("id", DataTypes.IntegerType, false)
+.add("name", DataTypes.StringType, true)
+.add("age", DataTypes.LongType, true)
+.add("score", DataTypes.DoubleType, true)
+.add("active", DataTypes.BooleanType, true);
+
+// Create test data
+List expectedRows = new ArrayList<>();
+expectedRows.add(createRow(1, "Alice", 30L, 95.5, true));
+expectedRows.add(createRow(2, "Bob", 25L, 87.3, false));
+expectedRows.add(createRow(3, "Charlie", 35L, 92.1, true));
+
+// Write and read back
+StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_primitives.lance");
+try (HoodieSparkLanceReader reader = writeAndCreateReader(path, schema, 
expectedRows)) {
+  // Verify record count
+  assertEquals(expectedRows.size(), reader.getTotalRecords(), "Record 
count should match");
+
+  // Verify schema
+  assertNotNull(reader.getSchema(), "Schema should not be null");
+
+  // Read all records
+  List actualRows = readAllRows(reader);
+
+  // Verify record count
+  assertEquals(expectedRows.size(), actualRows.size(), "Should read same 
number of records");
+
+  // Verify each record
+  for (int i = 0; i < expectedRows.size(); i++) {
+InternalRow expected = expectedRows.get(i);
+InternalRow actual = actualRows.get(i);
+
+assertEquals(expected.getInt(0), actual.getInt(0), "id field should 
match");

Review Comment:
   @the-other-tim-brown 
   
   When doing an assertion on just the rows such as  ` assertEquals(expected, 
actual);` this fails for 
   ```
   or

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644243219


##
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieArrowAllocator.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+
+/**
+ * Manages Arrow BufferAllocator lifecycle for Arrow-based file format 
operations.
+ *
+ * Following Arrow best practices:
+ * 
+ *   Single RootAllocator per application
+ *   Named child allocators for debugging and isolation
+ *   Caller-specified memory limits per child allocator
+ * 
+ *
+ * The root allocator is hardcoded to Long.MAX_VALUE and acts as a
+ * bookkeeper. Memory limits are enforced at the child allocator level, with 
each
+ * caller specifying an appropriate limit for their use case.
+ *
+ * @see https://arrow.apache.org/docs/java/memory.html";>Arrow Memory 
Management
+ */
+public class HoodieArrowAllocator {
+
+  private HoodieArrowAllocator() {
+// Utility class
+  }
+
+  /**
+   * Initialization-on-demand holder idiom for thread-safe lazy initialization.
+   * The root allocator is created when first accessed and has a maximum size 
of Long.MAX_VALUE.
+   */
+  private static class RootAllocatorHolder {
+static final BufferAllocator INSTANCE = new RootAllocator(Long.MAX_VALUE);
+
+static {
+  Runtime.getRuntime().addShutdownHook(new Thread(
+  INSTANCE::close, "hudi-arrow-root-allocator-shutdown"));
+}
+  }
+
+  /**
+   * Get the shared root allocator.
+   * Thread-safe lazy initialization using the holder idiom.
+   *
+   * @return The singleton root allocator instance
+   */
+  private static BufferAllocator getRootAllocator() {
+return RootAllocatorHolder.INSTANCE;
+  }
+
+  /**
+   * Create a named child allocator for Arrow operations.
+   * Caller is responsible for closing the returned allocator.
+   *
+   * @param name Descriptive name for debugging (e.g., 
"HoodieSparkLanceReader-data-file.lance")
+   * @param childSizeBytes Maximum memory size in bytes for this child 
allocator
+   * @return A new child allocator with the specified size limit
+   */
+  public static BufferAllocator newChildAllocator(String name, long 
childSizeBytes) {

Review Comment:
   Yes it was mainly around the file name adding to the length but if there is 
no limit then it will be good to have that to better debug



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644199798


##
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieArrowAllocator.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+
+/**
+ * Manages Arrow BufferAllocator lifecycle for Arrow-based file format 
operations.
+ *
+ * Following Arrow best practices:
+ * 
+ *   Single RootAllocator per application
+ *   Named child allocators for debugging and isolation
+ *   Caller-specified memory limits per child allocator
+ * 
+ *
+ * The root allocator is hardcoded to Long.MAX_VALUE and acts as a
+ * bookkeeper. Memory limits are enforced at the child allocator level, with 
each
+ * caller specifying an appropriate limit for their use case.
+ *
+ * @see https://arrow.apache.org/docs/java/memory.html";>Arrow Memory 
Management
+ */
+public class HoodieArrowAllocator {
+
+  private HoodieArrowAllocator() {
+// Utility class
+  }
+
+  /**
+   * Initialization-on-demand holder idiom for thread-safe lazy initialization.
+   * The root allocator is created when first accessed and has a maximum size 
of Long.MAX_VALUE.
+   */
+  private static class RootAllocatorHolder {
+static final BufferAllocator INSTANCE = new RootAllocator(Long.MAX_VALUE);
+
+static {
+  Runtime.getRuntime().addShutdownHook(new Thread(
+  INSTANCE::close, "hudi-arrow-root-allocator-shutdown"));
+}
+  }
+
+  /**
+   * Get the shared root allocator.
+   * Thread-safe lazy initialization using the holder idiom.
+   *
+   * @return The singleton root allocator instance
+   */
+  private static BufferAllocator getRootAllocator() {
+return RootAllocatorHolder.INSTANCE;
+  }
+
+  /**
+   * Create a named child allocator for Arrow operations.
+   * Caller is responsible for closing the returned allocator.
+   *
+   * @param name Descriptive name for debugging (e.g., 
"HoodieSparkLanceReader-data-file.lance")
+   * @param childSizeBytes Maximum memory size in bytes for this child 
allocator
+   * @return A new child allocator with the specified size limit
+   */
+  public static BufferAllocator newChildAllocator(String name, long 
childSizeBytes) {

Review Comment:
   @the-other-tim-brown When checking the arrow docs 
https://arrow.apache.org/docs/java/memory.html did not see anything around a 
size limit for the name.
   
   When checking the arrow java impl, I also dont see any logic around size 
naming restraints atleast when checking the super classes of ChildAllocator: 
https://github.com/apache/arrow-java/blob/main/memory/memory-core/src/main/java/org/apache/arrow/memory/ChildAllocator.java#L36
   
   
https://github.com/apache/arrow-java/blob/main/memory/memory-core/src/main/java/org/apache/arrow/memory/BaseAllocator.java#L121
   
https://github.com/apache/arrow-java/blob/main/memory/memory-core/src/main/java/org/apache/arrow/memory/Accountant.java#L65
   
   Just curious if the concern was due to me appending a file path (for some of 
the naming) for these child allocators like so 
https://github.com/apache/hudi/pull/17632/files#diff-ae5484bd256dc2ac54f998bd0a2952a2ef3789058fc87331343d539dff12fdb7R77.
 I did this for debuggability but its not essential and I can remove it if 
needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3687851257

   
   ## CI report:
   
   * 98a2fe6a7650bf01a411253720c01eea7a207941 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10478)
 
   * 9b7dfb1c712b120065844e3497a72d4be51e15d0 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10485)
 
   * 912b385b4f57de367898ba71085efccf5d164959 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10490)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644160887


##
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieArrowAllocator.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;

Review Comment:
   Will change the package name to better reflect that.
   
   In terms of where this class should live I put it in `hudi-common` since I 
wanted it to be leveraged by other engines. If hudi-client-common also provides 
the same gurantees of reuse then i am ok with moving this class there.
   cc @the-other-tim-brown @yihua 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644120478


##
hudi-hadoop-common/src/main/java/org/apache/hudi/io/lance/HoodieBaseLanceWriter.java:
##
@@ -123,6 +127,9 @@ public long getWrittenRecordCount() {
*/
   @Override
   public void close() throws IOException {
+Exception primaryException = null;

Review Comment:
   I pushed a recent commit following a pattern that @voonhous requested around 
exception handling. See voon original comment here 
https://github.com/apache/hudi/pull/17632#discussion_r2639980943



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644160887


##
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieArrowAllocator.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;

Review Comment:
   Will change the package name to better reflect that.
   
   In terms of where this class should live I put it in `hudi-common` since I 
wanted it to be leveraged by other engines. If hudi-client-common also provides 
the same gurantees of reuse then i am ok with moving this class there.
   cc @the-other-tim-brown @yihua 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


yihua commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644156909


##
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieArrowAllocator.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;

Review Comment:
   I think `org.apache.hudi.io.memory` is more suitable given that 
`HoodieArrowAllocator` is I/O and memory management related.
   
   If the class is intended to be used by both reader and writer, not 
containing Hudi logic, it would be good to add the class to `hudi-io` module.  
`hudi-client-common` module is mainly for writer classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644120478


##
hudi-hadoop-common/src/main/java/org/apache/hudi/io/lance/HoodieBaseLanceWriter.java:
##
@@ -123,6 +127,9 @@ public long getWrittenRecordCount() {
*/
   @Override
   public void close() throws IOException {
+Exception primaryException = null;

Review Comment:
   I pushed a recent commit following a pattern that @voonhous requested



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3687784070

   
   ## CI report:
   
   * 98a2fe6a7650bf01a411253720c01eea7a207941 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10478)
 
   * 9b7dfb1c712b120065844e3497a72d4be51e15d0 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10485)
 
   * 912b385b4f57de367898ba71085efccf5d164959 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2644067177


##
hudi-hadoop-common/src/main/java/org/apache/hudi/io/lance/HoodieBaseLanceWriter.java:
##
@@ -123,6 +127,9 @@ public long getWrittenRecordCount() {
*/
   @Override
   public void close() throws IOException {
+Exception primaryException = null;

Review Comment:
   This variable is unused, let's remove it



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,331 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.HoodieArrowAllocator;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // Memory size for data read operations: 120MB
+  public static final long LANCE_DATA_ALLOCATOR_SIZE = 120 * 1024 * 1024;
+
+  // Memory size for metadata operations: 8MB
+  private static final long LANCE_METADATA_ALLOCATOR_SIZE = 8 * 1024 * 1024;
+
+  // number of rows to read
+  private static final int DEFAULT_BATCH_SIZE = 512;
+  private final StoragePath path;
+
+  public HoodieSparkLanceReader(StoragePath path) {
+this.path = path;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchem

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3687671900

   
   ## CI report:
   
   * 98a2fe6a7650bf01a411253720c01eea7a207941 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10478)
 
   * 9b7dfb1c712b120065844e3497a72d4be51e15d0 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10485)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3687666746

   
   ## CI report:
   
   * 98a2fe6a7650bf01a411253720c01eea7a207941 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10478)
 
   * 9b7dfb1c712b120065844e3497a72d4be51e15d0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3687494432

   
   ## CI report:
   
   * 98a2fe6a7650bf01a411253720c01eea7a207941 Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10478)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3687117082

   
   ## CI report:
   
   * f081df10721e3071bba1676b6aff6c9e466d11ae Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10451)
 
   * 98a2fe6a7650bf01a411253720c01eea7a207941 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10478)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-23 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3687109161

   
   ## CI report:
   
   * f081df10721e3071bba1676b6aff6c9e466d11ae Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10451)
 
   * 98a2fe6a7650bf01a411253720c01eea7a207941 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2641654876


##
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieArrowAllocator.java:
##
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+
+/**
+ * Manages Arrow BufferAllocator lifecycle for Arrow-based file format 
operations.
+ *
+ * Following Arrow best practices:
+ * 
+ *   Single RootAllocator per application
+ *   Named child allocators for debugging and isolation
+ *   Caller-specified memory limits per child allocator
+ * 
+ *
+ * The root allocator is hardcoded to Long.MAX_VALUE and acts as a
+ * bookkeeper. Memory limits are enforced at the child allocator level, with 
each
+ * caller specifying an appropriate limit for their use case.
+ *
+ * @see https://arrow.apache.org/docs/java/memory.html";>Arrow Memory 
Management
+ */
+public class HoodieArrowAllocator {
+
+  private static volatile BufferAllocator rootAllocator;
+
+  private HoodieArrowAllocator() {
+// Utility class
+  }
+
+  /**
+   * Get or create the shared root allocator using double-checked locking.

Review Comment:
   You can avoid this locking by using the intialization-on-demand pattern 
https://en.wikipedia.org/wiki/Initialization-on-demand_holder_idiom



##
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieArrowAllocator.java:
##
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+
+/**
+ * Manages Arrow BufferAllocator lifecycle for Arrow-based file format 
operations.
+ *
+ * Following Arrow best practices:
+ * 
+ *   Single RootAllocator per application
+ *   Named child allocators for debugging and isolation
+ *   Caller-specified memory limits per child allocator
+ * 
+ *
+ * The root allocator is hardcoded to Long.MAX_VALUE and acts as a
+ * bookkeeper. Memory limits are enforced at the child allocator level, with 
each
+ * caller specifying an appropriate limit for their use case.
+ *
+ * @see https://arrow.apache.org/docs/java/memory.html";>Arrow Memory 
Management
+ */
+public class HoodieArrowAllocator {
+
+  private static volatile BufferAllocator rootAllocator;
+
+  private HoodieArrowAllocator() {
+// Utility class
+  }
+
+  /**
+   * Get or create the shared root allocator using double-checked locking.
+   * The root allocator is (Long.MAX_VALUE).
+   */
+  private static BufferAllocator getRootAllocator() {
+if (rootAllocator == null) {
+  synchronized (HoodieArrowAllocator.class) {
+if (rootAllocator == null) {
+  rootAllocator = new RootAllocator(Long.MAX_VALUE);
+}
+  }
+}
+return rootAllocator;
+  }
+
+  /**
+   * Create a named child allocator for Arrow operations.
+   * Caller is responsible for closing the returned allocator.
+   *
+   * @param name Descriptive name for debugging (e.g., 
"HoodieSparkLanceReader-data-file.lance")
+   * @param childSizeBytes Maximum memory size in bytes for this child 
allocator
+   * @return A new child allocator with the specified size limit
+   */
+  public static BufferAllocator newChildAllocator(String name, long 
childSizeByt

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3684634853

   
   ## CI report:
   
   * f081df10721e3071bba1676b6aff6c9e466d11ae Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2641544023


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2641544023


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3684550588

   
   ## CI report:
   
   * cd2f4f0accde969b51e65de106d519f1fc3668dc Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10438)
 
   * f081df10721e3071bba1676b6aff6c9e466d11ae Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3684548160

   
   ## CI report:
   
   * cd2f4f0accde969b51e65de106d519f1fc3668dc Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10438)
 
   * f081df10721e3071bba1676b6aff6c9e466d11ae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640912528


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding su

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


voonhous commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640827786


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for w

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3683395962

   
   ## CI report:
   
   * cd2f4f0accde969b51e65de106d519f1fc3668dc Azure: 
[SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10438)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640791764


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640615875


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding su

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640550221


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding su

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640542317


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


voonhous commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640512516


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for w

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3682874049

   
   ## CI report:
   
   * 3f835c210b9523627cb2e629cc8ce30dd9b7fa20 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10402)
 
   * cd2f4f0accde969b51e65de106d519f1fc3668dc Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10438)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3682868296

   
   ## CI report:
   
   * 3f835c210b9523627cb2e629cc8ce30dd9b7fa20 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10402)
 
   * cd2f4f0accde969b51e65de106d519f1fc3668dc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640391279


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640391279


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


voonhous commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2639980943


##
hudi-hadoop-common/src/main/java/org/apache/hudi/io/lance/HoodieBaseLanceWriter.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.io.lance;
+
+import com.lancedb.lance.file.LanceFileWriter;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.types.pojo.Schema;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+
+import javax.annotation.concurrent.NotThreadSafe;
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Base class for Hudi Lance file writers supporting different record types.
+ *
+ * This class handles common Lance file operations including:
+ * - LanceFileWriter lifecycle management
+ * - BufferAllocator management
+ * - Record buffering and batch flushing
+ * - File size checks
+ *
+ * Subclasses must implement type-specific conversion to Arrow format.
+ *
+ * @param  The record type (e.g., GenericRecord, InternalRow)
+ */
+@NotThreadSafe
+public abstract class HoodieBaseLanceWriter implements Closeable {
+  protected static final int DEFAULT_BATCH_SIZE = 1000;
+  protected final HoodieStorage storage;
+  protected final StoragePath path;
+  protected final BufferAllocator allocator;
+  protected final List bufferedRecords;
+  protected final int batchSize;
+  protected final long maxFileSize;
+  protected long writtenRecordCount = 0;
+  protected VectorSchemaRoot root;
+
+  private LanceFileWriter writer;
+
+  /**
+   * Constructor for base Lance writer.
+   *
+   * @param storage HoodieStorage instance
+   * @param path Path where Lance file will be written
+   * @param batchSize Number of records to buffer before flushing to Lance
+   * @param maxFileSize Maximum file size in bytes before rolling over to new 
file
+   */
+  protected HoodieBaseLanceWriter(HoodieStorage storage, StoragePath path, int 
batchSize, long maxFileSize) {
+this.storage = storage;
+this.path = path;
+this.allocator = new RootAllocator(Long.MAX_VALUE);
+this.bufferedRecords = new ArrayList<>(batchSize);
+this.batchSize = batchSize;
+this.maxFileSize = maxFileSize;
+  }
+
+  /**
+   * Populate the VectorSchemaRoot with buffered records.
+   * Subclasses must implement type-specific conversion logic.
+   * The VectorSchemaRoot field is reused across batches and managed by this 
base class.
+   *
+   * @param records List of records to convert
+   */
+  protected abstract void populateVectorSchemaRoot(List records);
+
+  /**
+   * Get the Arrow schema for this writer.
+   * Subclasses must provide the Arrow schema corresponding to their record 
type.
+   *
+   * @return Arrow schema
+   */
+  protected abstract Schema getArrowSchema();
+
+  /**
+   * Write a single record. Records are buffered and flushed in batches.
+   *
+   * @param record Record to write
+   * @throws IOException if write fails
+   */
+  public void write(R record) throws IOException {
+bufferedRecords.add(record);
+writtenRecordCount++;
+
+if (bufferedRecords.size() >= batchSize) {
+  flushBatch();
+}
+  }
+
+  /**
+   * Check if writer can accept more records based on file size.
+   * Uses filesystem-based size checking (similar to ORC/HFile approach).
+   *
+   * @return true if writer can accept more records, false if file size limit 
reached
+   */
+  public boolean canWrite() {
+//TODO will need to implement proper way to compute this
+return true;
+  }
+
+  /**
+   * Get the total number of records written so far.
+   *
+   * @return Number of records written
+   */
+  public long getWrittenRecordCount() {
+return writtenRecordCount;
+  }
+
+  /**
+   * Close the writer, flushing any remaining buffered records.
+   *
+   * @throws IOException if close fails
+   */
+  @Override
+  public void close() throws IOException {
+try {
+  // Flush any remaining bu

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640200665


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding su

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2639958124


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2640021666


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-22 Thread via GitHub


rahil-c commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2639958124


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding support for wh

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-21 Thread via GitHub


the-other-tim-brown commented on code in PR #17632:
URL: https://github.com/apache/hudi/pull/17632#discussion_r2638165667


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java:
##
@@ -0,0 +1,326 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.HoodieSchemaConversionUtils;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieSparkRecord;
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.storage.HoodieStorage;
+import org.apache.hudi.storage.StoragePath;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.util.LanceArrowUtils;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import static org.apache.hudi.common.util.TypeUtils.unsafeCast;
+
+/**
+ * {@link HoodieSparkFileReader} implementation for Lance file format.
+ */
+public class HoodieSparkLanceReader implements HoodieSparkFileReader {
+  // number of rows to read 
+  private static final int DEFAULT_BATCH_SIZE = 512; 
+  private final StoragePath path;
+  private final HoodieStorage storage;
+
+  public HoodieSparkLanceReader(HoodieStorage storage, StoragePath path) {
+this.path = path;
+this.storage = storage;
+  }
+
+  @Override
+  public String[] readMinMaxRecordKeys() {
+throw new UnsupportedOperationException("Min/max record key tracking is 
not yet supported for Lance file format");
+  }
+
+  @Override
+  public BloomFilter readBloomFilter() {
+throw new UnsupportedOperationException("Bloom filter is not yet supported 
for Lance file format");
+  }
+
+  @Override
+  public Set> filterRowKeys(Set candidateRowKeys) {
+Set> result = new HashSet<>();
+long position = 0;
+
+try (ClosableIterator keyIterator = getRecordKeyIterator()) {
+  while (keyIterator.hasNext()) {
+String recordKey = keyIterator.next();
+// If filter is empty/null, then all keys will be added.
+// if filter has specific keys, then ensure only those are added
+if (candidateRowKeys == null || candidateRowKeys.isEmpty()
+|| candidateRowKeys.contains(recordKey)) {
+  result.add(Pair.of(recordKey, position));
+}
+position++;
+  }
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to filter row keys from Lance file: 
" + path, e);
+}
+
+return result;
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema readerSchema, HoodieSchema requestedSchema) 
throws IOException {
+return getRecordIterator(requestedSchema);
+  }
+
+  @Override
+  public ClosableIterator> 
getRecordIterator(HoodieSchema schema) throws IOException {
+ClosableIterator iterator = getUnsafeRowIterator();
+return new CloseableMappingIterator<>(iterator, data -> unsafeCast(new 
HoodieSparkRecord(data)));
+  }
+
+  @Override
+  public ClosableIterator getRecordKeyIterator() throws IOException {
+//TODO to revisit adding su

Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-20 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3677871951

   
   ## CI report:
   
   * 3f835c210b9523627cb2e629cc8ce30dd9b7fa20 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10402)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-20 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3677849374

   
   ## CI report:
   
   * 6546ecca89fc7cbafc8ad861866d5dc553885b46 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10363)
 
   * 3f835c210b9523627cb2e629cc8ce30dd9b7fa20 Azure: 
[PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10402)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: Add HoodieSparkLanceReader for reading lance files to internal row [hudi]

2025-12-20 Thread via GitHub


hudi-bot commented on PR #17632:
URL: https://github.com/apache/hudi/pull/17632#issuecomment-3677848138

   
   ## CI report:
   
   * 6546ecca89fc7cbafc8ad861866d5dc553885b46 Azure: 
[FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10363)
 
   * 3f835c210b9523627cb2e629cc8ce30dd9b7fa20 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]