[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472696179



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala
##
@@ -56,6 +57,13 @@ trait FileDataSourceV2 extends TableProvider with 
DataSourceRegister {
 paths ++ Option(map.get("path")).toSeq
   }
 
+  protected def getOptionsWithoutPaths(map: CaseInsensitiveStringMap): 
CaseInsensitiveStringMap = {

Review comment:
   nit:
   ```
   val withoutPath = options.asCaseSensitiveMap().asScala.filterKeys {
 k => !k.equalsIgnoreCase("path") && !k.equalsIgnoreCase("paths")
   }
   new CaseInsensitiveStringMap(withoutPath.asJava)
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472694528



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala
##
@@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
 case class TextTable(
 name: String,
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,

Review comment:
   do we still need to rename it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471936744



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
##
@@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils
 
 abstract class FileTable(
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,
 paths: Seq[String],
 userSpecifiedSchema: Option[StructType])
   extends Table with SupportsRead with SupportsWrite {
 
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 
+  // Options without path-related options from `originalOptions`.
+  protected final lazy val options: CaseInsensitiveStringMap = {
+val caseInsensitiveMap = 
CaseInsensitiveMap(originalOptions.asCaseSensitiveMap.asScala.toMap)
+val caseInsensitiveMapWithoutPaths = caseInsensitiveMap - "paths" - "path"
+new CaseInsensitiveStringMap(
+  
caseInsensitiveMapWithoutPaths.asInstanceOf[CaseInsensitiveMap[String]].originalMap.asJava)
+  }

Review comment:
   I can't remember why we don't get the `paths` in this parent class. 
Seems clearer. One thing we need to take care is getting `paths` should be 
lazy. Please take a try.
   
   also cc @gengliangwang 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471934868



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters.originalMap

Review comment:
   This is consistent with `DataFrameReader.loadV1Source`, +1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-17 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471930084



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
##
@@ -191,9 +191,11 @@ case class DataSource(
 val dataSchema = userSpecifiedSchema.map { schema =>
   StructType(schema.filterNot(f => partitionSchema.exists(p => 
equality(p.name, f.name
 }.orElse {
+  // Remove "path" option so that it is not added to the paths returned by
+  // `tempFileIndex.allFiles()`.
   format.inferSchema(
 sparkSession,
-caseInsensitiveOptions,
+caseInsensitiveOptions - "path",

Review comment:
   `FileFormat` is an internal API, I think it's natural to assume `paths` 
parameter will be used, and all builtin implementations do.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-17 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471312730



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   Good point on v2.
   
   I think in v2, the `FileTable.options` should not contain `path/paths` 
anymore, as we already extracted the paths from the options. Can we do some 
refactor to also clear the path options in a higher level for v2 as well?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-16 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471239203



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   Can we do it at the caller side of `FileFormat.inferSchema`? Then we 
don't need to change various file source implementations.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org