[ https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-26339: --------------------------------- Fix Version/s: (was: 3.0.0) > Behavior of reading files that start with underscore is confusing > ----------------------------------------------------------------- > > Key: SPARK-26339 > URL: https://issues.apache.org/jira/browse/SPARK-26339 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Keiichi Hirobe > Assignee: Keiichi Hirobe > Priority: Minor > > Behavior of reading files that start with underscore is as follows. > # spark.read (no schema) throws exception which message is confusing. > # spark.read (userSpecificationSchema) succesfully reads, but content is > emtpy. > Example of files are as follows. > The same behavior occured when I read json files. > {code:bash} > $ cat test.csv > test1,10 > test2,20 > $ cp test.csv _test.csv > $ ./bin/spark-shell --master local[2] > {code} > Spark shell snippet for reproduction: > {code:java} > scala> val df=spark.read.csv("test.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string] > scala> df.show() > +-----+---+ > | _c0|_c1| > +-----+---+ > |test1| 10| > |test2| 20| > +-----+---+ > scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv") > df: org.apache.spark.sql.DataFrame = [test: string, number: int] > scala> df.show() > +-----+------+ > | test|number| > +-----+------+ > |test1| 10| > |test2| 20| > +-----+------+ > scala> val df=spark.read.csv("_test.csv") > org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It > must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478) > ... 49 elided > scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv") > df: org.apache.spark.sql.DataFrame = [test: string, number: int] > scala> df.show() > +----+------+ > |test|number| > +----+------+ > +----+------+ > {code} > I noticed that spark cannot read files that start with underscore after I > read some codes.(I could not find any documents about file name limitation) > Above behavior is not good especially userSpecificationSchema case, I think. > I suggest to throw exception which message is "Path does not exist" in both > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org