[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472718669



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/text/TextTable.scala
##
@@ -28,11 +28,11 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
 case class TextTable(
 name: String,
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,

Review comment:
   Fixed.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala
##
@@ -56,6 +57,13 @@ trait FileDataSourceV2 extends TableProvider with 
DataSourceRegister {
 paths ++ Option(map.get("path")).toSeq
   }
 
+  protected def getOptionsWithoutPaths(map: CaseInsensitiveStringMap): 
CaseInsensitiveStringMap = {

Review comment:
   Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472443571



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
##
@@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils
 
 abstract class FileTable(
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,
 paths: Seq[String],
 userSpecifiedSchema: Option[StructType])
   extends Table with SupportsRead with SupportsWrite {
 
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 
+  // Options without path-related options from `originalOptions`.
+  protected final lazy val options: CaseInsensitiveStringMap = {
+val caseInsensitiveMap = 
CaseInsensitiveMap(originalOptions.asCaseSensitiveMap.asScala.toMap)
+val caseInsensitiveMapWithoutPaths = caseInsensitiveMap - "paths" - "path"
+new CaseInsensitiveStringMap(
+  
caseInsensitiveMapWithoutPaths.asInstanceOf[CaseInsensitiveMap[String]].originalMap.asJava)
+  }

Review comment:
   I tried to move `getPaths` to `FileTable` here: 
https://github.com/apache/spark/pull/29437/commits/15fd313044649031d4919108509e3827301565b7.
 That required moving `getName` (which depends on paths) as well, and 
`FileTable` needs to have a dependency on `DatasourceRegister` to get 
`shortName`. Also, whether to use the original options or options without paths 
is still not clear with this approach.
   
   So, I followed @viirya's 
[comment](https://github.com/apache/spark/pull/29437#discussion_r471933928) 
where options without paths are passed to `FileTable`, which seems to be more 
straightforward. The latest commit has this approach.
   
   Please let me know which approach seems better. Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-18 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r472426984



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
##
@@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils
 
 abstract class FileTable(
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,

Review comment:
   This seems like a cleaner approach. I updated the PR with this approach.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-17 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471831707



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
##
@@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils
 
 abstract class FileTable(
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,
 paths: Seq[String],
 userSpecifiedSchema: Option[StructType])
   extends Table with SupportsRead with SupportsWrite {
 
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 
+  // Options without path-related options from `originalOptions`.
+  protected final lazy val options: CaseInsensitiveStringMap = {

Review comment:
   Note that this can still be masked:
   ```scala
   class DummyFileTable(
   sparkSession: SparkSession,
   options: CaseInsensitiveStringMap, // could be unintentional
   paths: Seq[String],
   expectedDataSchema: StructType,
   userSpecifiedSchema: Option[StructType])
 extends FileTable(sparkSession, options, paths, userSpecifiedSchema) {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-17 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471830558



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
##
@@ -120,7 +120,7 @@ object TextInputJsonDataSource extends JsonDataSource {
 sparkSession,
 paths = inputPaths.map(_.getPath.toString),
 className = classOf[TextFileFormat].getName,
-options = parsedOptions.parameters
+options = parsedOptions.parameters.originalMap

Review comment:
   ditto

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -155,7 +155,7 @@ object TextInputCSVDataSource extends CSVDataSource {
 sparkSession,
 paths = paths,
 className = classOf[TextFileFormat].getName,
-options = options.parameters
+options = options.parameters.originalMap

Review comment:
   this is not related to this PR, but I think this should be case 
sensitive for hadoop configs, so I am fixing it here.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
##
@@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils
 
 abstract class FileTable(
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,
 paths: Seq[String],
 userSpecifiedSchema: Option[StructType])
   extends Table with SupportsRead with SupportsWrite {
 
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 
+  // Options without path-related options from `originalOptions`.
+  protected final lazy val options: CaseInsensitiveStringMap = {
+val caseInsensitiveMap = 
CaseInsensitiveMap(originalOptions.asCaseSensitiveMap.asScala.toMap)
+val caseInsensitiveMapWithoutPaths = caseInsensitiveMap - "paths" - "path"
+new CaseInsensitiveStringMap(
+  
caseInsensitiveMapWithoutPaths.asInstanceOf[CaseInsensitiveMap[String]].originalMap.asJava)
+  }

Review comment:
   @cloud-fan Should I move `FileDataSourceV2.getPaths` to this class so 
that the operations with path-related options are in one place? (Wanted to make 
sure before making a bigger change).

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
##
@@ -34,13 +35,21 @@ import org.apache.spark.sql.util.SchemaUtils
 
 abstract class FileTable(
 sparkSession: SparkSession,
-options: CaseInsensitiveStringMap,
+originalOptions: CaseInsensitiveStringMap,
 paths: Seq[String],
 userSpecifiedSchema: Option[StructType])
   extends Table with SupportsRead with SupportsWrite {
 
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
 
+  // Options without path-related options from `originalOptions`.
+  protected final lazy val options: CaseInsensitiveStringMap = {

Review comment:
   Note that this can still masked:
   ```scala
   class DummyFileTable(
   sparkSession: SparkSession,
   options: CaseInsensitiveStringMap, // could be unintentional
   paths: Seq[String],
   expectedDataSchema: StructType,
   userSpecifiedSchema: Option[StructType])
 extends FileTable(sparkSession, options, paths, userSpecifiedSchema) {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-17 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471253976



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   For v1, I can fix it at the caller side of `FileFormat.inferSchema` to 
cover all datasources. And for v2, I can fix it at `CSVTable.inferSchema` and 
`JSONTable.inferSchema`; the other built-in datasources don't seem to be 
affected. WDYT?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-16 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471246355



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   And `FileTable.inferSchema` doesn't take in options.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-16 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471243651



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   But it's not going to fix v2 datasources since `inferSchema` is called 
from `FileTable`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-15 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471026242



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##
@@ -1450,10 +1450,20 @@ abstract class CSVSuite extends QueryTest with 
SharedSparkSession with TestCsvDa
   val ds = sampledTestData.coalesce(1)
   ds.write.text(path.getAbsolutePath)
 
-  val readback = spark.read
+  val readback1 = spark.read
 .option("inferSchema", true).option("samplingRatio", 0.1)
 .csv(path.getCanonicalPath)
-  assert(readback.schema == new StructType().add("_c0", IntegerType))
+  assert(readback1.schema == new StructType().add("_c0", IntegerType))
+
+  // SPARK-32621: During infer, "path" option gets added again to the 
paths that have already

Review comment:
   Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org