Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown merged PR #17660: URL: https://github.com/apache/hudi/pull/17660 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660: URL: https://github.com/apache/hudi/pull/17660#discussion_r2653136969 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java: ## @@ -59,7 +58,7 @@ protected HoodieFileWriter newParquetFileWriter( compressionCodecName = null; } //TODO boundary to revisit in follow up to use HoodieSchema directly Review Comment: @rahil-c make sure to remove this in your next PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3698492808 ## CI report: * bf13d4ab6396b66a465e5682a47f9fc7f4c1c5bd Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10668) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3698356709 ## CI report: * 522ccfa35121e2f097a1dbe8cc8debd1cfcfd25a Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10664) * bf13d4ab6396b66a465e5682a47f9fc7f4c1c5bd Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10668) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3698327614 ## CI report: * 522ccfa35121e2f097a1dbe8cc8debd1cfcfd25a Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10664) * bf13d4ab6396b66a465e5682a47f9fc7f4c1c5bd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
yihua commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652204230
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##
@@ -424,8 +430,14 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath:
String,
override def inferSchema(sparkSession: SparkSession, options: Map[String,
String], files: Seq[FileStatus]): Option[StructType] = {
Review Comment:
As long as the Lance read works without the change in the inferSchema method
we can go without this change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3698295564 ## CI report: * 522ccfa35121e2f097a1dbe8cc8debd1cfcfd25a Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10664) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3698290616 ## CI report: * c05368a3f64f1469c6032112a18cb6d2200255e3 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10650) * 522ccfa35121e2f097a1dbe8cc8debd1cfcfd25a Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10664) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3698288335 ## CI report: * c05368a3f64f1469c6032112a18cb6d2200255e3 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10650) * 522ccfa35121e2f097a1dbe8cc8debd1cfcfd25a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652108515
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##
@@ -424,8 +430,14 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath:
String,
override def inferSchema(sparkSession: SparkSession, options: Map[String,
String], files: Seq[FileStatus]): Option[StructType] = {
Review Comment:
@yihua Discussed with Tim, for now I think I will opt to remove changes in
this code path and follow up in a seperate PR for this `inferSchema` piece.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652098172
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##
@@ -424,8 +430,14 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath:
String,
override def inferSchema(sparkSession: SparkSession, options: Map[String,
String], files: Seq[FileStatus]): Option[StructType] = {
if (isMultipleBaseFileFormatsEnabled || hoodieFileFormat ==
HoodieFileFormat.PARQUET) {
ParquetUtils.inferSchema(sparkSession, options, files)
-} else {
+} else if (hoodieFileFormat == HoodieFileFormat.ORC) {
OrcUtils.inferSchema(sparkSession, files, options)
+} else if (hoodieFileFormat == HoodieFileFormat.LANCE) {
+ // TODO: Implement Lance schema inference
Review Comment:
to link TODO here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-369870 ## CI report: * c05368a3f64f1469c6032112a18cb6d2200255e3 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10650) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652068430
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##
@@ -424,8 +430,14 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath:
String,
override def inferSchema(sparkSession: SparkSession, options: Map[String,
String], files: Seq[FileStatus]): Option[StructType] = {
Review Comment:
Currently in this code path we basically leverage the `ParquetUtils` from
`org.apache.spark.sql.execution.datasources.parquet` and `OrcUtils` from
`org.apache.spark.sql.execution.datasources.orc` for doing the inference.
@yihua so to confirm for lance we should return an exception or None?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
yihua commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652050195
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##
@@ -424,8 +430,14 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath:
String,
override def inferSchema(sparkSession: SparkSession, options: Map[String,
String], files: Seq[FileStatus]): Option[StructType] = {
Review Comment:
This method, `inferSchema`, should only be invoked if the schema is not
provided at the data source or relation level for Hudi. I think we should
consider throwing an exception or returning `None` for all file formats.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
yihua commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652015072
##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_4Adapter.scala:
##
@@ -154,6 +154,13 @@ class Spark3_4Adapter extends BaseSpark3Adapter {
Spark34OrcReader.build(vectorized, sqlConf, options, hadoopConf,
dataSchema)
}
+ override def createLanceFileReader(vectorized: Boolean,
+ sqlConf: SQLConf,
+ options: Map[String, String],
+ hadoopConf: Configuration):
SparkColumnarFileReader = {
+throw new UnsupportedOperationException("Lance format is not supported in
Spark 3.4")
Review Comment:
I couldn't recall when the decision was made. If it's easy to add Lance
support on Spark 3.4, we should do that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
yihua commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652007981
##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieLanceRecordIterator.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+/**
+ * Shared iterator implementation for reading Lance files and converting Arrow
batches to Spark rows.
+ * This iterator is used by both Hudi's internal Lance reader and Spark
datasource integration.
+ *
+ * The iterator manages the lifecycle of:
+ *
+ * BufferAllocator - Arrow memory management
+ * LanceFileReader - Lance file handle
+ * ArrowReader - Arrow batch reader
+ * ColumnarBatch - Current batch being iterated
+ *
+ *
+ * Records are converted to {@link UnsafeRow} using {@link
UnsafeProjection} for efficient
+ * serialization and memory management.
+ */
+public class HoodieLanceRecordIterator implements ClosableIterator {
Review Comment:
For new classes under `org.apache.hudi`, we should avoid prefix `Hoodie` in
the class names for readability, e.g., `org.apache.hudi.storage.StoragePath`,
`org.apache.hudi.storage.StorageConfiguration`, unless `Hoodie` prefix in the
class name provides better clarity, e.g., `HoodieKeyIterator`.
For this class, the naming `LanceRecordIterator` is better.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2652003692
##
pom.xml:
##
@@ -2562,7 +2564,7 @@
3.3.2
lance-spark-3.4_${scala.binary.version}
-false
Review Comment:
I think that we should support 3.4 since it is supported by lance and seems
like minimal overhead at this point.
##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##
@@ -49,7 +49,12 @@ public enum HoodieFileFormat {
+ "way to store Hive data. It was designed to overcome limitations of
the other Hive file "
+ "formats. Using ORC files improves performance when Hive is reading,
writing, and "
+ "processing data.")
- ORC(".orc");
+ ORC(".orc"),
+
+ @EnumFieldDescription("Lance is a modern columnar data format optimized for
random access patterns, "
+ + "and designed for ML and AI workloads"
+ + "")
Review Comment:
```suggestion
@EnumFieldDescription("Lance is a modern columnar data format optimized
for random access patterns "
+ "and designed for ML and AI workloads"
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3697787616 ## CI report: * 40a558d9bb9a2a87d30dd0b91940436a0229976c Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10634) * c05368a3f64f1469c6032112a18cb6d2200255e3 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10650) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3697785399 ## CI report: * 40a558d9bb9a2a87d30dd0b91940436a0229976c Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10634) * c05368a3f64f1469c6032112a18cb6d2200255e3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651822072
##
pom.xml:
##
@@ -2562,7 +2564,7 @@
3.3.2
lance-spark-3.4_${scala.binary.version}
-false
Review Comment:
If we are aiming to support spark 3.4, then we want to ensure we run tests
so it should be false
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651619098
##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##
@@ -49,7 +49,11 @@ public enum HoodieFileFormat {
+ "way to store Hive data. It was designed to overcome limitations of
the other Hive file "
+ "formats. Using ORC files improves performance when Hive is reading,
writing, and "
+ "processing data.")
- ORC(".orc");
+ ORC(".orc"),
+
+ @EnumFieldDescription("Lance is a modern columnar data format optimized for
ML and AI workloads. "
+ + "It provides efficient random access, and integration with Apache
Arrow.")
Review Comment:
I don't think it is important to call out Arrow. You can integrate with
Arrow for parquet as well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3697060666 ## CI report: * 40a558d9bb9a2a87d30dd0b91940436a0229976c Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10634) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651366106
##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala:
##
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.DefaultSparkRecordMerger
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.testutils.HoodieSparkClientTestBase
+
+import org.apache.spark.sql.{SaveMode, SparkSession}
+import org.junit.jupiter.api.{AfterEach, BeforeEach, Test}
+import org.junit.jupiter.api.Assertions.assertEquals
+import org.junit.jupiter.api.condition.DisabledIfSystemProperty
+
+/**
+ * Basic functional tests for Lance file format with Hudi Spark datasource.
+ */
+@DisabledIfSystemProperty(named = "lance.skip.tests", matches = "true")
+class TestLanceDataSource extends HoodieSparkClientTestBase {
+
+ var spark: SparkSession = _
+
+ @BeforeEach
+ override def setUp(): Unit = {
+super.setUp()
+spark = sqlContext.sparkSession
+ }
+
+ @AfterEach
+ override def tearDown(): Unit = {
+super.tearDown()
+spark = null
+ }
+
+ @Test
+ def testBasicWriteAndRead(): Unit = {
+val tableName = "test_lance_table"
+val tablePath = s"$basePath/$tableName"
+
+// Create test data
+val records = Seq(
+ (1, "Alice", 30, 95.5),
+ (2, "Bob", 25, 87.3),
+ (3, "Charlie", 35, 92.1)
+)
+val df = spark.createDataFrame(records).toDF("id", "name", "age", "score")
+
+// Write to Hudi table with Lance base file format
+df.write
+ .format("hudi")
+ .option(HoodieTableConfig.BASE_FILE_FORMAT.key(), "LANCE")
+ .option(RECORDKEY_FIELD.key(), "id")
+ .option(PRECOMBINE_FIELD.key(), "age")
+ .option(TABLE_NAME.key(), tableName)
+ .option(HoodieWriteConfig.TBL_NAME.key(), tableName)
+ .option(HoodieWriteConfig.RECORD_MERGE_IMPL_CLASSES.key(),
classOf[DefaultSparkRecordMerger].getName)
+ .mode(SaveMode.Overwrite)
+ .save(tablePath)
+
+// Read back and verify
+ val readDf = spark.read
+ .format("hudi")
+ .load(tablePath)
+
+val result = readDf.select("id", "name", "age", "score")
+ .orderBy("id")
+ .collect()
+
+assertEquals(3, result.length, "Should read 3 records")
Review Comment:
The assertions are quite verbose. Is there a way to use Rows or Sequences to
make this leaner? I see the next PR for Lance also adds more tests so it will
be best to establish any required utilities now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651353965
##
pom.xml:
##
@@ -2562,7 +2564,7 @@
3.3.2
lance-spark-3.4_${scala.binary.version}
-false
Review Comment:
Should this be `false`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3696873549 ## CI report: * 5189042e4b9b27b0a23c7bfc1ae5ac2b123edadf Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10614) * 40a558d9bb9a2a87d30dd0b91940436a0229976c Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10634) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3696869215 ## CI report: * 5189042e4b9b27b0a23c7bfc1ae5ac2b123edadf Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10614) * 40a558d9bb9a2a87d30dd0b91940436a0229976c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651144217
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.lance
+
+import org.apache.hudi.common.util
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.io.memory.HoodieArrowAllocator
+import org.apache.hudi.io.storage.{HoodieLanceRecordIterator,
HoodieSparkLanceReader}
+import org.apache.hudi.storage.StorageConfiguration
+
+import com.lancedb.lance.file.LanceFileReader
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.schema.MessageType
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.{PartitionedFile,
SparkColumnarFileReader}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.LanceArrowUtils
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+
+/**
+ * Reader for Lance files in Spark datasource.
+ * Implements vectorized reading using LanceArrowColumnVector.
+ *
+ * @param enableVectorizedReader whether to use vectorized reading (currently
always true for Lance)
+ */
+class SparkLanceReaderBase(enableVectorizedReader: Boolean) extends
SparkColumnarFileReader {
+
+ // Batch size for reading Lance files (number of rows per batch)
+ private val DEFAULT_BATCH_SIZE = 512
+
+ /**
+ * Read a Lance file with schema projection and partition column support.
+ *
+ * @param file Lance file to read
+ * @param requiredSchemadesired output schema of the data (columns to
read)
+ * @param partitionSchema schema of the partition columns. Partition
values will be appended to the end of every row
+ * @param internalSchemaOpt option of internal schema for schema.on.read
(not currently used for Lance)
+ * @param filters filters for data skipping. Not guaranteed to be
used; the spark plan will also apply the filters.
+ * @param storageConf the hadoop conf
+ * @return iterator of rows read from the file output type says
[[InternalRow]] but could be [[ColumnarBatch]]
+ */
+ override def read(file: PartitionedFile,
+requiredSchema: StructType,
+partitionSchema: StructType,
+internalSchemaOpt: util.Option[InternalSchema],
+filters: Seq[Filter],
+storageConf: StorageConfiguration[Configuration],
+tableSchemaOpt: util.Option[MessageType] =
util.Option.empty()): Iterator[InternalRow] = {
+
+val filePath = file.filePath.toString
+
+if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
+ // No columns requested - return empty iterator
+ Iterator.empty
+} else {
+ // Create child allocator for reading
+ val allocator =
HoodieArrowAllocator.newChildAllocator(getClass.getSimpleName + "-data-" +
filePath,
+HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE);
+
+ try {
+// Open Lance file reader
+val lanceReader = LanceFileReader.open(filePath, allocator)
+
+// Extract column names from required schema for projection
+val columnNames: java.util.List[String] = if (requiredSchema.nonEmpty)
{
+ requiredSchema.fields.map(_.name).toList.asJava
+} else {
+ // If only partition columns requested, read minimal data
+ null
+}
+
+// Get schema from Lance file for HoodieLanceRecordIterator
+val arrowSchema = lanceReader.schema()
+val sparkSchema = LanceArrowUtils.fromArrowSchema(arrowSchema)
+
+// Read data with column projection (filters not supported yet)
+val arrowReader = lanceReader.readAll(columnNames, null,
DEFAULT_BATCH_SIZE)
+
+// Create iterator using shared HoodieLanceRecordIterator
+val lanceIterator
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651096559
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.lance
+
+import org.apache.hudi.common.util
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.io.memory.HoodieArrowAllocator
+import org.apache.hudi.io.storage.{HoodieLanceRecordIterator,
HoodieSparkLanceReader}
+import org.apache.hudi.storage.StorageConfiguration
+
+import com.lancedb.lance.file.LanceFileReader
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.schema.MessageType
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.{PartitionedFile,
SparkColumnarFileReader}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.LanceArrowUtils
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+
+/**
+ * Reader for Lance files in Spark datasource.
+ * Implements vectorized reading using LanceArrowColumnVector.
+ *
+ * @param enableVectorizedReader whether to use vectorized reading (currently
always true for Lance)
+ */
+class SparkLanceReaderBase(enableVectorizedReader: Boolean) extends
SparkColumnarFileReader {
+
+ // Batch size for reading Lance files (number of rows per batch)
+ private val DEFAULT_BATCH_SIZE = 512
+
+ /**
+ * Read a Lance file with schema projection and partition column support.
+ *
+ * @param file Lance file to read
+ * @param requiredSchemadesired output schema of the data (columns to
read)
+ * @param partitionSchema schema of the partition columns. Partition
values will be appended to the end of every row
+ * @param internalSchemaOpt option of internal schema for schema.on.read
(not currently used for Lance)
+ * @param filters filters for data skipping. Not guaranteed to be
used; the spark plan will also apply the filters.
+ * @param storageConf the hadoop conf
+ * @return iterator of rows read from the file output type says
[[InternalRow]] but could be [[ColumnarBatch]]
+ */
+ override def read(file: PartitionedFile,
+requiredSchema: StructType,
+partitionSchema: StructType,
+internalSchemaOpt: util.Option[InternalSchema],
+filters: Seq[Filter],
+storageConf: StorageConfiguration[Configuration],
+tableSchemaOpt: util.Option[MessageType] =
util.Option.empty()): Iterator[InternalRow] = {
+
+val filePath = file.filePath.toString
+
+if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
+ // No columns requested - return empty iterator
+ Iterator.empty
+} else {
+ // Create child allocator for reading
+ val allocator =
HoodieArrowAllocator.newChildAllocator(getClass.getSimpleName + "-data-" +
filePath,
+HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE);
+
+ try {
+// Open Lance file reader
+val lanceReader = LanceFileReader.open(filePath, allocator)
+
+// Extract column names from required schema for projection
+val columnNames: java.util.List[String] = if (requiredSchema.nonEmpty)
{
+ requiredSchema.fields.map(_.name).toList.asJava
+} else {
+ // If only partition columns requested, read minimal data
+ null
+}
+
+// Get schema from Lance file for HoodieLanceRecordIterator
+val arrowSchema = lanceReader.schema()
+val sparkSchema = LanceArrowUtils.fromArrowSchema(arrowSchema)
+
+// Read data with column projection (filters not supported yet)
+val arrowReader = lanceReader.readAll(columnNames, null,
DEFAULT_BATCH_SIZE)
+
+// Create iterator using shared HoodieLanceRecordIterator
+val lanceIterator
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651147479
##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##
@@ -49,7 +49,11 @@ public enum HoodieFileFormat {
+ "way to store Hive data. It was designed to overcome limitations of
the other Hive file "
+ "formats. Using ORC files improves performance when Hive is reading,
writing, and "
+ "processing data.")
- ORC(".orc");
+ ORC(".orc"),
+
+ @EnumFieldDescription("Lance is a modern columnar data format optimized for
ML and AI workloads. "
+ + "It provides efficient random access, versioning, and integration with
Apache Arrow.")
Review Comment:
@the-other-tim-brown Actually you are correct it is a property of the lance
table format, so it probably does not make sense to include this since we are
only integrating with lance at file format level. Let me remove this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651144849
##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_4Adapter.scala:
##
@@ -154,6 +154,13 @@ class Spark3_4Adapter extends BaseSpark3Adapter {
Spark34OrcReader.build(vectorized, sqlConf, options, hadoopConf,
dataSchema)
}
+ override def createLanceFileReader(vectorized: Boolean,
+ sqlConf: SQLConf,
+ options: Map[String, String],
+ hadoopConf: Configuration):
SparkColumnarFileReader = {
+throw new UnsupportedOperationException("Lance format is not supported in
Spark 3.4")
Review Comment:
Will add spark 3.4 as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651144217
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.lance
+
+import org.apache.hudi.common.util
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.io.memory.HoodieArrowAllocator
+import org.apache.hudi.io.storage.{HoodieLanceRecordIterator,
HoodieSparkLanceReader}
+import org.apache.hudi.storage.StorageConfiguration
+
+import com.lancedb.lance.file.LanceFileReader
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.schema.MessageType
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.{PartitionedFile,
SparkColumnarFileReader}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.LanceArrowUtils
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+
+/**
+ * Reader for Lance files in Spark datasource.
+ * Implements vectorized reading using LanceArrowColumnVector.
+ *
+ * @param enableVectorizedReader whether to use vectorized reading (currently
always true for Lance)
+ */
+class SparkLanceReaderBase(enableVectorizedReader: Boolean) extends
SparkColumnarFileReader {
+
+ // Batch size for reading Lance files (number of rows per batch)
+ private val DEFAULT_BATCH_SIZE = 512
+
+ /**
+ * Read a Lance file with schema projection and partition column support.
+ *
+ * @param file Lance file to read
+ * @param requiredSchemadesired output schema of the data (columns to
read)
+ * @param partitionSchema schema of the partition columns. Partition
values will be appended to the end of every row
+ * @param internalSchemaOpt option of internal schema for schema.on.read
(not currently used for Lance)
+ * @param filters filters for data skipping. Not guaranteed to be
used; the spark plan will also apply the filters.
+ * @param storageConf the hadoop conf
+ * @return iterator of rows read from the file output type says
[[InternalRow]] but could be [[ColumnarBatch]]
+ */
+ override def read(file: PartitionedFile,
+requiredSchema: StructType,
+partitionSchema: StructType,
+internalSchemaOpt: util.Option[InternalSchema],
+filters: Seq[Filter],
+storageConf: StorageConfiguration[Configuration],
+tableSchemaOpt: util.Option[MessageType] =
util.Option.empty()): Iterator[InternalRow] = {
+
+val filePath = file.filePath.toString
+
+if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
+ // No columns requested - return empty iterator
+ Iterator.empty
+} else {
+ // Create child allocator for reading
+ val allocator =
HoodieArrowAllocator.newChildAllocator(getClass.getSimpleName + "-data-" +
filePath,
+HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE);
+
+ try {
+// Open Lance file reader
+val lanceReader = LanceFileReader.open(filePath, allocator)
+
+// Extract column names from required schema for projection
+val columnNames: java.util.List[String] = if (requiredSchema.nonEmpty)
{
+ requiredSchema.fields.map(_.name).toList.asJava
+} else {
+ // If only partition columns requested, read minimal data
+ null
+}
+
+// Get schema from Lance file for HoodieLanceRecordIterator
+val arrowSchema = lanceReader.schema()
+val sparkSchema = LanceArrowUtils.fromArrowSchema(arrowSchema)
+
+// Read data with column projection (filters not supported yet)
+val arrowReader = lanceReader.readAll(columnNames, null,
DEFAULT_BATCH_SIZE)
+
+// Create iterator using shared HoodieLanceRecordIterator
+val lanceIterator
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651104039
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.lance
+
+import org.apache.hudi.common.util
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.io.memory.HoodieArrowAllocator
+import org.apache.hudi.io.storage.{HoodieLanceRecordIterator,
HoodieSparkLanceReader}
+import org.apache.hudi.storage.StorageConfiguration
+
+import com.lancedb.lance.file.LanceFileReader
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.schema.MessageType
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.{PartitionedFile,
SparkColumnarFileReader}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.LanceArrowUtils
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+
+/**
+ * Reader for Lance files in Spark datasource.
+ * Implements vectorized reading using LanceArrowColumnVector.
+ *
+ * @param enableVectorizedReader whether to use vectorized reading (currently
always true for Lance)
+ */
+class SparkLanceReaderBase(enableVectorizedReader: Boolean) extends
SparkColumnarFileReader {
+
+ // Batch size for reading Lance files (number of rows per batch)
+ private val DEFAULT_BATCH_SIZE = 512
+
+ /**
+ * Read a Lance file with schema projection and partition column support.
+ *
+ * @param file Lance file to read
+ * @param requiredSchemadesired output schema of the data (columns to
read)
+ * @param partitionSchema schema of the partition columns. Partition
values will be appended to the end of every row
+ * @param internalSchemaOpt option of internal schema for schema.on.read
(not currently used for Lance)
+ * @param filters filters for data skipping. Not guaranteed to be
used; the spark plan will also apply the filters.
+ * @param storageConf the hadoop conf
+ * @return iterator of rows read from the file output type says
[[InternalRow]] but could be [[ColumnarBatch]]
+ */
+ override def read(file: PartitionedFile,
+requiredSchema: StructType,
+partitionSchema: StructType,
+internalSchemaOpt: util.Option[InternalSchema],
+filters: Seq[Filter],
+storageConf: StorageConfiguration[Configuration],
+tableSchemaOpt: util.Option[MessageType] =
util.Option.empty()): Iterator[InternalRow] = {
+
+val filePath = file.filePath.toString
+
+if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
+ // No columns requested - return empty iterator
+ Iterator.empty
+} else {
+ // Create child allocator for reading
+ val allocator =
HoodieArrowAllocator.newChildAllocator(getClass.getSimpleName + "-data-" +
filePath,
+HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE);
+
+ try {
+// Open Lance file reader
+val lanceReader = LanceFileReader.open(filePath, allocator)
+
+// Extract column names from required schema for projection
+val columnNames: java.util.List[String] = if (requiredSchema.nonEmpty)
{
+ requiredSchema.fields.map(_.name).toList.asJava
+} else {
+ // If only partition columns requested, read minimal data
+ null
+}
+
+// Get schema from Lance file for HoodieLanceRecordIterator
+val arrowSchema = lanceReader.schema()
+val sparkSchema = LanceArrowUtils.fromArrowSchema(arrowSchema)
+
+// Read data with column projection (filters not supported yet)
+val arrowReader = lanceReader.readAll(columnNames, null,
DEFAULT_BATCH_SIZE)
+
+// Create iterator using shared HoodieLanceRecordIterator
+val la
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651098927
##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_4Adapter.scala:
##
@@ -154,6 +154,13 @@ class Spark3_4Adapter extends BaseSpark3Adapter {
Spark34OrcReader.build(vectorized, sqlConf, options, hadoopConf,
dataSchema)
}
+ override def createLanceFileReader(vectorized: Boolean,
+ sqlConf: SQLConf,
+ options: Map[String, String],
+ hadoopConf: Configuration):
SparkColumnarFileReader = {
+throw new UnsupportedOperationException("Lance format is not supported in
Spark 3.4")
Review Comment:
If it is not a lot of extra work to add 3.4 support, then I don't see why we
would skip it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651097334
##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##
@@ -49,7 +49,11 @@ public enum HoodieFileFormat {
+ "way to store Hive data. It was designed to overcome limitations of
the other Hive file "
+ "formats. Using ORC files improves performance when Hive is reading,
writing, and "
+ "processing data.")
- ORC(".orc");
+ ORC(".orc"),
+
+ @EnumFieldDescription("Lance is a modern columnar data format optimized for
ML and AI workloads. "
+ + "It provides efficient random access, versioning, and integration with
Apache Arrow.")
Review Comment:
This looks like it is a property of the table format, not the file format
though. Maybe I am missing something? Where will versioning get added to Hudi?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2651096559
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.lance
+
+import org.apache.hudi.common.util
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.io.memory.HoodieArrowAllocator
+import org.apache.hudi.io.storage.{HoodieLanceRecordIterator,
HoodieSparkLanceReader}
+import org.apache.hudi.storage.StorageConfiguration
+
+import com.lancedb.lance.file.LanceFileReader
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.schema.MessageType
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.{PartitionedFile,
SparkColumnarFileReader}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.LanceArrowUtils
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+
+/**
+ * Reader for Lance files in Spark datasource.
+ * Implements vectorized reading using LanceArrowColumnVector.
+ *
+ * @param enableVectorizedReader whether to use vectorized reading (currently
always true for Lance)
+ */
+class SparkLanceReaderBase(enableVectorizedReader: Boolean) extends
SparkColumnarFileReader {
+
+ // Batch size for reading Lance files (number of rows per batch)
+ private val DEFAULT_BATCH_SIZE = 512
+
+ /**
+ * Read a Lance file with schema projection and partition column support.
+ *
+ * @param file Lance file to read
+ * @param requiredSchemadesired output schema of the data (columns to
read)
+ * @param partitionSchema schema of the partition columns. Partition
values will be appended to the end of every row
+ * @param internalSchemaOpt option of internal schema for schema.on.read
(not currently used for Lance)
+ * @param filters filters for data skipping. Not guaranteed to be
used; the spark plan will also apply the filters.
+ * @param storageConf the hadoop conf
+ * @return iterator of rows read from the file output type says
[[InternalRow]] but could be [[ColumnarBatch]]
+ */
+ override def read(file: PartitionedFile,
+requiredSchema: StructType,
+partitionSchema: StructType,
+internalSchemaOpt: util.Option[InternalSchema],
+filters: Seq[Filter],
+storageConf: StorageConfiguration[Configuration],
+tableSchemaOpt: util.Option[MessageType] =
util.Option.empty()): Iterator[InternalRow] = {
+
+val filePath = file.filePath.toString
+
+if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
+ // No columns requested - return empty iterator
+ Iterator.empty
+} else {
+ // Create child allocator for reading
+ val allocator =
HoodieArrowAllocator.newChildAllocator(getClass.getSimpleName + "-data-" +
filePath,
+HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE);
+
+ try {
+// Open Lance file reader
+val lanceReader = LanceFileReader.open(filePath, allocator)
+
+// Extract column names from required schema for projection
+val columnNames: java.util.List[String] = if (requiredSchema.nonEmpty)
{
+ requiredSchema.fields.map(_.name).toList.asJava
+} else {
+ // If only partition columns requested, read minimal data
+ null
+}
+
+// Get schema from Lance file for HoodieLanceRecordIterator
+val arrowSchema = lanceReader.schema()
+val sparkSchema = LanceArrowUtils.fromArrowSchema(arrowSchema)
+
+// Read data with column projection (filters not supported yet)
+val arrowReader = lanceReader.readAll(columnNames, null,
DEFAULT_BATCH_SIZE)
+
+// Create iterator using shared HoodieLanceRecordIterator
+val lanceIterator
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2650957828
##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_4Adapter.scala:
##
@@ -154,6 +154,13 @@ class Spark3_4Adapter extends BaseSpark3Adapter {
Spark34OrcReader.build(vectorized, sqlConf, options, hadoopConf,
dataSchema)
}
+ override def createLanceFileReader(vectorized: Boolean,
+ sqlConf: SQLConf,
+ options: Map[String, String],
+ hadoopConf: Configuration):
SparkColumnarFileReader = {
+throw new UnsupportedOperationException("Lance format is not supported in
Spark 3.4")
Review Comment:
When I discussed with @yihua I believe we only wanted to support lance for
the latest two versions in hudi spark(3.5, 4.0).
In terms of what lance supports from the docs there is support for 3.4 and
beyond https://lance.org/integrations/spark/install/#scala, so if we feel that
we should include spark 3.4.0 then will add this.
https://github.com/user-attachments/assets/aa23910e-3f9f-4afe-9709-1e4ab8bbf04b";
/>
cc @the-other-tim-brown @nsivabalan
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2650957828
##
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/adapter/Spark3_4Adapter.scala:
##
@@ -154,6 +154,13 @@ class Spark3_4Adapter extends BaseSpark3Adapter {
Spark34OrcReader.build(vectorized, sqlConf, options, hadoopConf,
dataSchema)
}
+ override def createLanceFileReader(vectorized: Boolean,
+ sqlConf: SQLConf,
+ options: Map[String, String],
+ hadoopConf: Configuration):
SparkColumnarFileReader = {
+throw new UnsupportedOperationException("Lance format is not supported in
Spark 3.4")
Review Comment:
When I discussed with @yihua I believe we only wanted to support lance for
the latest two versions in hudi spark(3.5, 4.0).
In terms of what lance supports from the docs there is support for 3.4 and
beyond https://lance.org/integrations/spark/install/#scala, so if we feel that
we should include spark 3.4.0 then will add this.
https://github.com/user-attachments/assets/aa23910e-3f9f-4afe-9709-1e4ab8bbf04b";
/>
cc @nsivabalan
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2650951391
##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##
@@ -49,7 +49,11 @@ public enum HoodieFileFormat {
+ "way to store Hive data. It was designed to overcome limitations of
the other Hive file "
+ "formats. Using ORC files improves performance when Hive is reading,
writing, and "
+ "processing data.")
- ORC(".orc");
+ ORC(".orc"),
+
+ @EnumFieldDescription("Lance is a modern columnar data format optimized for
ML and AI workloads. "
+ + "It provides efficient random access, versioning, and integration with
Apache Arrow.")
Review Comment:
Versioning in this context meant that the lance keeps track of all table
changes (inserts, updates, deletes) and the ability to go back to a specific
state of the table.
https://docs.lancedb.com/tables/versioning
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2650944296
##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieLanceRecordIterator.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+/**
+ * Shared iterator implementation for reading Lance files and converting Arrow
batches to Spark rows.
+ * This iterator is used by both Hudi's internal Lance reader and Spark
datasource integration.
+ *
+ * The iterator manages the lifecycle of:
+ *
+ * BufferAllocator - Arrow memory management
+ * LanceFileReader - Lance file handle
+ * ArrowReader - Arrow batch reader
+ * ColumnarBatch - Current batch being iterated
+ *
+ *
+ * Records are converted to {@link UnsafeRow} using {@link
UnsafeProjection} for efficient
+ * serialization and memory management.
+ */
+public class HoodieLanceRecordIterator implements ClosableIterator {
Review Comment:
@the-other-tim-brown @yihua @yihua It seems there is some pattern is to
prefix with Hoodie for classes.
https://github.com/user-attachments/assets/affbe261-0f64-47fa-9758-7c4d4dad8251";
/>
I am not opinionated on whether we have the prefix or not.
##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieLanceRecordIterator.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+/**
+ * Shared iterator implementation for reading Lance files and converting Arrow
batches to Spark rows.
+ * This iterator is used by both Hudi's internal Lance reader and Spark
datasource integration.
+ *
+ * The iterator manages the lifecycle of:
+ *
+ * BufferAlloca
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2649890213
##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieLanceRecordIterator.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+/**
+ * Shared iterator implementation for reading Lance files and converting Arrow
batches to Spark rows.
+ * This iterator is used by both Hudi's internal Lance reader and Spark
datasource integration.
+ *
+ * The iterator manages the lifecycle of:
+ *
+ * BufferAllocator - Arrow memory management
+ * LanceFileReader - Lance file handle
+ * ArrowReader - Arrow batch reader
+ * ColumnarBatch - Current batch being iterated
+ *
+ *
+ * Records are converted to {@link UnsafeRow} using {@link
UnsafeProjection} for efficient
+ * serialization and memory management.
+ */
+public class HoodieLanceRecordIterator implements ClosableIterator {
+ private final BufferAllocator allocator;
+ private final LanceFileReader lanceReader;
+ private final ArrowReader arrowReader;
+ private final UnsafeProjection projection;
+ private final String path;
+
+ private ColumnarBatch currentBatch;
+ private Iterator rowIterator;
+ private ColumnVector[] columnVectors;
+ private boolean closed = false;
+
+ /**
+ * Creates a new Lance record iterator.
+ *
+ * @param allocator Arrow buffer allocator for memory management
+ * @param lanceReader Lance file reader
+ * @param arrowReader Arrow reader for batch reading
+ * @param schema Spark schema for the records
+ * @param path File path (for error messages)
+ */
+ public HoodieLanceRecordIterator(BufferAllocator allocator,
+ LanceFileReader lanceReader,
+ ArrowReader arrowReader,
+ StructType schema,
+ String path) {
+this.allocator = allocator;
+this.lanceReader = lanceReader;
+this.arrowReader = arrowReader;
+this.projection = UnsafeProjection.create(schema);
+this.path = path;
+ }
+
+ @Override
+ public boolean hasNext() {
+// If we have records in current batch, return true
+if (rowIterator != null && rowIterator.hasNext()) {
+ return true;
+}
+
+// Close previous batch before loading next
+if (currentBatch != null) {
+ currentBatch.close();
+ currentBatch = null;
+}
+
+// Try to load next batch
+try {
+ if (arrowReader.loadNextBatch()) {
+VectorSchemaRoot root = arrowReader.getVectorSchemaRoot();
+
+// Wrap each Arrow FieldVector in LanceArrowColumnVector for type-safe
access
+// Cache the column wrappers on first batch and reuse for all
subsequent batches
+if (columnVectors == null) {
+ columnVectors = root.getFieldVectors().stream()
+ .map(LanceArrowColumnVector::new)
+ .toArray(ColumnVector[]::new);
+}
+
+// Create ColumnarBatch and keep it alive while iterating
+currentBatch = new ColumnarBatch(columnVectors, root.getRowCount());
+rowIterator = currentBatch.rowIterator();
+return rowIterator.hasNext();
+ }
+} catch (IOException e) {
+ throw new HoodieException("Failed to read ne
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
the-other-tim-brown commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2649886085
##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieLanceRecordIterator.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import com.lancedb.lance.file.LanceFileReader;
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.apache.spark.sql.vectorized.LanceArrowColumnVector;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+/**
+ * Shared iterator implementation for reading Lance files and converting Arrow
batches to Spark rows.
+ * This iterator is used by both Hudi's internal Lance reader and Spark
datasource integration.
+ *
+ * The iterator manages the lifecycle of:
+ *
+ * BufferAllocator - Arrow memory management
+ * LanceFileReader - Lance file handle
+ * ArrowReader - Arrow batch reader
+ * ColumnarBatch - Current batch being iterated
+ *
+ *
+ * Records are converted to {@link UnsafeRow} using {@link
UnsafeProjection} for efficient
+ * serialization and memory management.
+ */
+public class HoodieLanceRecordIterator implements ClosableIterator {
Review Comment:
@yihua general question on naming, should we be prefixing all our classes
with `Hoodie`? Seems like we can just leverage the package to know it is
belonging to this project
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-369507 ## CI report: * 5189042e4b9b27b0a23c7bfc1ae5ac2b123edadf Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10614) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694434923 ## CI report: * 6c2f6a18ac714fa78a56c4606267db60c84c47f9 Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10613) * 5189042e4b9b27b0a23c7bfc1ae5ac2b123edadf Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10614) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694426638 ## CI report: * b9e3de02de56ddf044be8368cdc79e54ba2ad56b Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10611) * 6c2f6a18ac714fa78a56c4606267db60c84c47f9 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10613) * 5189042e4b9b27b0a23c7bfc1ae5ac2b123edadf Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10614) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694423578 ## CI report: * b9e3de02de56ddf044be8368cdc79e54ba2ad56b Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10611) * 6c2f6a18ac714fa78a56c4606267db60c84c47f9 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10613) * 5189042e4b9b27b0a23c7bfc1ae5ac2b123edadf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694396900 ## CI report: * b9e3de02de56ddf044be8368cdc79e54ba2ad56b Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10611) * 6c2f6a18ac714fa78a56c4606267db60c84c47f9 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10613) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694396208 ## CI report: * b9e3de02de56ddf044be8368cdc79e54ba2ad56b Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10611) * 6c2f6a18ac714fa78a56c4606267db60c84c47f9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694318733 ## CI report: * b9e3de02de56ddf044be8368cdc79e54ba2ad56b Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10611) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694271829 ## CI report: * f2fae6dcf117638d8a786d519a4fbd82fa134f32 Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10609) * b9e3de02de56ddf044be8368cdc79e54ba2ad56b Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10611) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694271169 ## CI report: * f2fae6dcf117638d8a786d519a4fbd82fa134f32 Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10609) * b9e3de02de56ddf044be8368cdc79e54ba2ad56b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694237245 ## CI report: * f2fae6dcf117638d8a786d519a4fbd82fa134f32 Azure: [SUCCESS](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10609) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694185293 ## CI report: * ad801e71d618ec2a2aa391e86ce7397ff7c34555 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10607) * f2fae6dcf117638d8a786d519a4fbd82fa134f32 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10609) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694184601 ## CI report: * ad801e71d618ec2a2aa391e86ce7397ff7c34555 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10607) * f2fae6dcf117638d8a786d519a4fbd82fa134f32 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660: URL: https://github.com/apache/hudi/pull/17660#discussion_r2649290633 ## pom.xml: ## @@ -2516,9 +2516,6 @@ hudi-spark3-common 2.8.1 - Review Comment: not sure why this is removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660: URL: https://github.com/apache/hudi/pull/17660#discussion_r2649290633 ## pom.xml: ## @@ -2516,9 +2516,6 @@ hudi-spark3-common 2.8.1 - Review Comment: not sure why this is removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
rahil-c commented on code in PR #17660:
URL: https://github.com/apache/hudi/pull/17660#discussion_r2649278871
##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.lance
+
+import org.apache.hudi.common.util
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.io.storage.HoodieLanceRecordIterator
+import org.apache.hudi.storage.StorageConfiguration
+
+import com.lancedb.lance.file.LanceFileReader
+import org.apache.arrow.memory.RootAllocator
+import org.apache.hadoop.conf.Configuration
+import org.apache.parquet.schema.MessageType
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.{PartitionedFile,
SparkColumnarFileReader}
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.LanceArrowUtils
+
+import java.io.IOException
+
+import scala.collection.JavaConverters._
+
+/**
+ * Reader for Lance files in Spark datasource.
+ * Implements vectorized reading using LanceArrowColumnVector.
+ *
+ * @param enableVectorizedReader whether to use vectorized reading (currently
always true for Lance)
+ */
+class SparkLanceReaderBase(enableVectorizedReader: Boolean) extends
SparkColumnarFileReader {
+
+ // Batch size for reading Lance files (number of rows per batch)
+ private val DEFAULT_BATCH_SIZE = 512
+
+ /**
+ * Read a Lance file with schema projection and partition column support.
+ *
+ * @param file Lance file to read
+ * @param requiredSchemadesired output schema of the data (columns to
read)
+ * @param partitionSchema schema of the partition columns. Partition
values will be appended to the end of every row
+ * @param internalSchemaOpt option of internal schema for schema.on.read
(not currently used for Lance)
+ * @param filters filters for data skipping. Not guaranteed to be
used; the spark plan will also apply the filters.
+ * @param storageConf the hadoop conf
+ * @return iterator of rows read from the file output type says
[[InternalRow]] but could be [[ColumnarBatch]]
+ */
+ override def read(file: PartitionedFile,
+requiredSchema: StructType,
+partitionSchema: StructType,
+internalSchemaOpt: util.Option[InternalSchema],
+filters: Seq[Filter],
+storageConf: StorageConfiguration[Configuration],
+tableSchemaOpt: util.Option[MessageType] =
util.Option.empty()): Iterator[InternalRow] = {
+
+val filePath = file.filePath.toString
+
+if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
+ // No columns requested - return empty iterator
+ Iterator.empty
+} else {
+ // Create Arrow allocator for reading
+ val allocator = new RootAllocator(Long.MaxValue)
Review Comment:
use hoodie arrow allocator
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694155353 ## CI report: * ad801e71d618ec2a2aa391e86ce7397ff7c34555 Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10607) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694153943 ## CI report: * 8007a690738625e6e1a38f577270694a094350bb Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10418) * ad801e71d618ec2a2aa391e86ce7397ff7c34555 Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10607) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3694153219 ## CI report: * 8007a690738625e6e1a38f577270694a094350bb Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10418) * ad801e71d618ec2a2aa391e86ce7397ff7c34555 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3678838648 ## CI report: * 8007a690738625e6e1a38f577270694a094350bb Azure: [FAILURE](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10418) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3678771544 ## CI report: * 8007a690738625e6e1a38f577270694a094350bb Azure: [PENDING](https://dev.azure.com/apachehudi/a1a51da7-8592-47d4-88dc-fd67bed336bb/_build/results?buildId=10418) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] feat: Implement SparkColumnarFileReader for Datasource integration with Lance [hudi]
hudi-bot commented on PR #17660: URL: https://github.com/apache/hudi/pull/17660#issuecomment-3678770337 ## CI report: * 8007a690738625e6e1a38f577270694a094350bb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
