[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470361#comment-16470361
 ] 

Apache Spark commented on SPARK-24068:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21292

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456654#comment-16456654
 ] 

Apache Spark commented on SPARK-24068:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21182

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455611#comment-16455611
 ] 

Hyukjin Kwon commented on SPARK-24068:
--

I roughly assume the fix will be small, similar or the same? I think it's fix 
to describe both issues here.

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-26 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454761#comment-16454761
 ] 

Maxim Gekk commented on SPARK-24068:


The same issue exists in JSON datasource. [~hyukjin.kwon] Do we need a separate 
ticket for that?

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451483#comment-16451483
 ] 

Hyukjin Kwon commented on SPARK-24068:
--

Hm, [~maxgekk], btw is this specific to CSV (not, for example JSON)?

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org