[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470361#comment-16470361 ] Apache Spark commented on SPARK-24068: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21292 > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456654#comment-16456654 ] Apache Spark commented on SPARK-24068: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21182 > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455611#comment-16455611 ] Hyukjin Kwon commented on SPARK-24068: -- I roughly assume the fix will be small, similar or the same? I think it's fix to describe both issues here. > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454761#comment-16454761 ] Maxim Gekk commented on SPARK-24068: The same issue exists in JSON datasource. [~hyukjin.kwon] Do we need a separate ticket for that? > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451483#comment-16451483 ] Hyukjin Kwon commented on SPARK-24068: -- Hm, [~maxgekk], btw is this specific to CSV (not, for example JSON)? > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org