[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16995 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16995#discussion_r102133102 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -374,34 +374,42 @@ case class DataSource( globPath }.toArray -val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format) - -val fileCatalog = if (sparkSession.sqlContext.conf.manageFilesourcePartitions && -catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog) { - val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes - new CatalogFileIndex( -sparkSession, -catalogTable.get, - catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize)) -} else { - new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema)) -} - -HadoopFsRelation( - fileCatalog, - partitionSchema = partitionSchema, - dataSchema = dataSchema.asNullable, - bucketSpec = bucketSpec, - format, - caseInsensitiveOptions)(sparkSession) - +createHadoopRelation(format, globbedPaths) case _ => throw new AnalysisException( s"$className is not a valid Spark SQL Data Source.") } relation } + /** + * Creates Hadoop relation based on format and globbed file paths + * @param format format of the data source file + * @param globPaths Path to the file resolved by Hadoop library + * @return Hadoop relation object + */ + def createHadoopRelation(format: FileFormat, + globPaths: Array[Path]): BaseRelation = { --- End diff -- Let's make this inlined. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...
Github user lxsmnv commented on a diff in the pull request: https://github.com/apache/spark/pull/16995#discussion_r102012068 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -404,6 +386,35 @@ case class DataSource( } /** +* Creates Hadoop relation based on format and globbed file paths +* @param format format of the data source file +* @param globPaths Path to the file resolved by Hadoop library +* @return Hadoop relation object +*/ + def createHadoopRelation(format: FileFormat, + globPaths: Array[Path]): BaseRelation = { +val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format) --- End diff -- @viirya I will fix this. Looks like merge issue. @maropu I will add tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16995#discussion_r101961405 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -404,6 +386,35 @@ case class DataSource( } /** +* Creates Hadoop relation based on format and globbed file paths +* @param format format of the data source file +* @param globPaths Path to the file resolved by Hadoop library +* @return Hadoop relation object +*/ + def createHadoopRelation(format: FileFormat, + globPaths: Array[Path]): BaseRelation = { +val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format) --- End diff -- You do twice `getOrInferFileFormatSchema`. One is before calling `createHadoopRelation`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16995: [SPARK-19340][SQL] CSV file will result in an exc...
GitHub user lxsmnv opened a pull request: https://github.com/apache/spark/pull/16995 [SPARK-19340][SQL] CSV file will result in an exception if the filename contains special characters ## What changes were proposed in this pull request? The root cause of the problem is that when spark is inferring schema from the csv file, it tries to resolve the file path pattern more then once by calling DataSouce.resolveRelation each time. So, if we have file path like: <...>/test* and the actual file with name: test{00-1}.txt Then from the initial call of DataSouce.resolveRelation the pattern will be resolved to /<...>/test{00-1}.txt. When it tries to infer schema for csv file, it calls DataSouce.resolveRelation the second time. The second attempt to resolve the path pattern fails because the file name /<...>/test{00-1}.txt is considered as a pattern and not as actual file and if there no file that match that pattern the whole DataSouce.resolveRelation fails. The idea behind the fix is quite straightforward: The part of DataSouce.resolveRelation that creates Hadoop Relation based on a resolved(actual) file names moved to separate function createHadoopRelation. CSVFileFormat.createBaseDataset calls this new function instead of DataSouce.resolveRelation, that caused unnecessary file path resolution. ## How was this patch tested? manual tests This contribution is my original work and I license the work to the project under the projectâs open source license. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lxsmnv/spark SPARK-19340 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16995.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16995 commit 507a929694653d49d1eb42398131743e0d004f65 Author: lxsmnvDate: 2017-02-20T01:52:40Z SPARK-19340 file path resolution for csv files fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org