[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...

2017-06-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16995


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...

2017-02-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16995#discussion_r102133102
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -374,34 +374,42 @@ case class DataSource(
   globPath
 }.toArray
 
-val (dataSchema, partitionSchema) = 
getOrInferFileFormatSchema(format)
-
-val fileCatalog = if 
(sparkSession.sqlContext.conf.manageFilesourcePartitions &&
-catalogTable.isDefined && 
catalogTable.get.tracksPartitionsInCatalog) {
-  val defaultTableSize = 
sparkSession.sessionState.conf.defaultSizeInBytes
-  new CatalogFileIndex(
-sparkSession,
-catalogTable.get,
-
catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))
-} else {
-  new InMemoryFileIndex(sparkSession, globbedPaths, options, 
Some(partitionSchema))
-}
-
-HadoopFsRelation(
-  fileCatalog,
-  partitionSchema = partitionSchema,
-  dataSchema = dataSchema.asNullable,
-  bucketSpec = bucketSpec,
-  format,
-  caseInsensitiveOptions)(sparkSession)
-
+createHadoopRelation(format, globbedPaths)
   case _ =>
 throw new AnalysisException(
   s"$className is not a valid Spark SQL Data Source.")
 }
 
 relation
   }
+  /**
+   * Creates Hadoop relation based on format and globbed file paths
+   * @param format format of the data source file
+   * @param globPaths Path to the file resolved by Hadoop library
+   * @return Hadoop relation object
+   */
+  def createHadoopRelation(format: FileFormat,
+   globPaths: Array[Path]): BaseRelation = {
--- End diff --

Let's make this inlined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...

2017-02-20 Thread lxsmnv
Github user lxsmnv commented on a diff in the pull request:

https://github.com/apache/spark/pull/16995#discussion_r102012068
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -404,6 +386,35 @@ case class DataSource(
   }
 
   /**
+* Creates Hadoop relation based on format and globbed file paths
+* @param format format of the data source file
+* @param globPaths Path to the file resolved by Hadoop library
+* @return Hadoop relation object
+*/
+  def createHadoopRelation(format: FileFormat,
+   globPaths: Array[Path]): BaseRelation = {
+val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format)
--- End diff --

@viirya I will fix this. Looks like merge issue.
@maropu I will add tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...

2017-02-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16995#discussion_r101961405
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -404,6 +386,35 @@ case class DataSource(
   }
 
   /**
+* Creates Hadoop relation based on format and globbed file paths
+* @param format format of the data source file
+* @param globPaths Path to the file resolved by Hadoop library
+* @return Hadoop relation object
+*/
+  def createHadoopRelation(format: FileFormat,
+   globPaths: Array[Path]): BaseRelation = {
+val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format)
--- End diff --

You do twice `getOrInferFileFormatSchema`. One is before calling 
`createHadoopRelation`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...

2017-02-19 Thread lxsmnv
GitHub user lxsmnv opened a pull request:

https://github.com/apache/spark/pull/16995

[SPARK-19340][SQL] CSV file will result in an exception if the filename 
contains special characters

## What changes were proposed in this pull request?
The root cause of the problem is that when spark is inferring schema from 
the csv file, it tries to resolve the file path pattern more then once by 
calling DataSouce.resolveRelation each time.

So, if we have file path like:
<...>/test*
and the actual file with name: test{00-1}.txt
Then from the initial call of DataSouce.resolveRelation the pattern will be 
resolved to /<...>/test{00-1}.txt. When it tries to infer schema for csv file, 
it calls DataSouce.resolveRelation the second time. The second attempt to 
resolve the path pattern fails because the file name /<...>/test{00-1}.txt is 
considered as a pattern and not as actual file and if there no file that match 
that pattern the  whole DataSouce.resolveRelation fails.

The idea behind the fix is quite straightforward:
The part of DataSouce.resolveRelation that creates Hadoop Relation based on 
a resolved(actual) file names moved to separate function createHadoopRelation. 
CSVFileFormat.createBaseDataset calls this new function instead of 
DataSouce.resolveRelation, that caused unnecessary file path resolution.  

## How was this patch tested?
manual tests

This contribution is my original work and I license the work to the project 
under the project’s open source license.
Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lxsmnv/spark SPARK-19340

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16995.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16995


commit 507a929694653d49d1eb42398131743e0d004f65
Author: lxsmnv 
Date:   2017-02-20T01:52:40Z

SPARK-19340 file path resolution for csv files fixed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org