[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r472718669 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala ## @@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap case class TextTable( name: String, sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, Review comment: Fixed. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala ## @@ -56,6 +57,13 @@ trait FileDataSourceV2 extends TableProvider with DataSourceRegister { paths ++ Option(map.get("path")).toSeq } + protected def getOptionsWithoutPaths(map: CaseInsensitiveStringMap): CaseInsensitiveStringMap = { Review comment: Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r472443571 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala ## @@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils abstract class FileTable( sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, paths: Seq[String], userSpecifiedSchema: Option[StructType]) extends Table with SupportsRead with SupportsWrite { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ + // Options without path-related options from `originalOptions`. + protected final lazy val options: CaseInsensitiveStringMap = { +val caseInsensitiveMap = CaseInsensitiveMap(originalOptions.asCaseSensitiveMap.asScala.toMap) +val caseInsensitiveMapWithoutPaths = caseInsensitiveMap - "paths" - "path" +new CaseInsensitiveStringMap( + caseInsensitiveMapWithoutPaths.asInstanceOf[CaseInsensitiveMap[String]].originalMap.asJava) + } Review comment: I tried to move `getPaths` to `FileTable` here: https://github.com/apache/spark/pull/29437/commits/15fd313044649031d4919108509e3827301565b7. That required moving `getName` (which depends on paths) as well, and `FileTable` needs to have a dependency on `DatasourceRegister` to get `shortName`. Also, whether to use the original options or options without paths is still not clear with this approach. So, I followed @viirya's [comment](https://github.com/apache/spark/pull/29437#discussion_r471933928) where options without paths are passed to `FileTable`, which seems to be more straightforward. The latest commit has this approach. Please let me know which approach seems better. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r472426984 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala ## @@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils abstract class FileTable( sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, Review comment: This seems like a cleaner approach. I updated the PR with this approach. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471831707 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala ## @@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils abstract class FileTable( sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, paths: Seq[String], userSpecifiedSchema: Option[StructType]) extends Table with SupportsRead with SupportsWrite { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ + // Options without path-related options from `originalOptions`. + protected final lazy val options: CaseInsensitiveStringMap = { Review comment: Note that this can still be masked: ```scala class DummyFileTable( sparkSession: SparkSession, options: CaseInsensitiveStringMap, // could be unintentional paths: Seq[String], expectedDataSchema: StructType, userSpecifiedSchema: Option[StructType]) extends FileTable(sparkSession, options, paths, userSpecifiedSchema) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471830558 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala ## @@ -120,7 +120,7 @@ object TextInputJsonDataSource extends JsonDataSource { sparkSession, paths = inputPaths.map(_.getPath.toString), className = classOf[TextFileFormat].getName, -options = parsedOptions.parameters +options = parsedOptions.parameters.originalMap Review comment: ditto ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource { sparkSession, paths = paths, className = classOf[TextFileFormat].getName, -options = options.parameters +options = options.parameters.originalMap Review comment: this is not related to this PR, but I think this should be case sensitive for hadoop configs, so I am fixing it here. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala ## @@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils abstract class FileTable( sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, paths: Seq[String], userSpecifiedSchema: Option[StructType]) extends Table with SupportsRead with SupportsWrite { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ + // Options without path-related options from `originalOptions`. + protected final lazy val options: CaseInsensitiveStringMap = { +val caseInsensitiveMap = CaseInsensitiveMap(originalOptions.asCaseSensitiveMap.asScala.toMap) +val caseInsensitiveMapWithoutPaths = caseInsensitiveMap - "paths" - "path" +new CaseInsensitiveStringMap( + caseInsensitiveMapWithoutPaths.asInstanceOf[CaseInsensitiveMap[String]].originalMap.asJava) + } Review comment: @cloud-fan Should I move `FileDataSourceV2.getPaths` to this class so that the operations with path-related options are in one place? (Wanted to make sure before making a bigger change). ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala ## @@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils abstract class FileTable( sparkSession: SparkSession, -options: CaseInsensitiveStringMap, +originalOptions: CaseInsensitiveStringMap, paths: Seq[String], userSpecifiedSchema: Option[StructType]) extends Table with SupportsRead with SupportsWrite { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ + // Options without path-related options from `originalOptions`. + protected final lazy val options: CaseInsensitiveStringMap = { Review comment: Note that this can still masked: ```scala class DummyFileTable( sparkSession: SparkSession, options: CaseInsensitiveStringMap, // could be unintentional paths: Seq[String], expectedDataSchema: StructType, userSpecifiedSchema: Option[StructType]) extends FileTable(sparkSession, options, paths, userSpecifiedSchema) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471253976 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable { inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { if (inputPaths.nonEmpty) { - Some(infer(sparkSession, inputPaths, parsedOptions)) + // Remove "path" option so that it is not added to the given `inputPaths`. Review comment: For v1, I can fix it at the caller side of `FileFormat.inferSchema` to cover all datasources. And for v2, I can fix it at `CSVTable.inferSchema` and `JSONTable.inferSchema`; the other built-in datasources don't seem to be affected. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471246355 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable { inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { if (inputPaths.nonEmpty) { - Some(infer(sparkSession, inputPaths, parsedOptions)) + // Remove "path" option so that it is not added to the given `inputPaths`. Review comment: And `FileTable.inferSchema` doesn't take in options. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471243651 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala ## @@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable { inputPaths: Seq[FileStatus], parsedOptions: CSVOptions): Option[StructType] = { if (inputPaths.nonEmpty) { - Some(infer(sparkSession, inputPaths, parsedOptions)) + // Remove "path" option so that it is not added to the given `inputPaths`. Review comment: But it's not going to fix v2 datasources since `inferSchema` is called from `FileTable`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
imback82 commented on a change in pull request #29437: URL: https://github.com/apache/spark/pull/29437#discussion_r471026242 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ## @@ -1450,10 +1450,20 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa val ds = sampledTestData.coalesce(1) ds.write.text(path.getAbsolutePath) - val readback = spark.read + val readback1 = spark.read .option("inferSchema", true).option("samplingRatio", 0.1) .csv(path.getCanonicalPath) - assert(readback.schema == new StructType().add("_c0", IntegerType)) + assert(readback1.schema == new StructType().add("_c0", IntegerType)) + + // SPARK-32621: During infer, "path" option gets added again to the paths that have already Review comment: Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org