[jira] [Updated] (SPARK-29280) DataFrameReader should support a compression option

2020-03-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-29280:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> DataFrameReader should support a compression option
> ---
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
>  supports a {{compression}} option, but 
> [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
>  doesn't. The lack of a {{compression}} option in the reader causes some 
> friction in the following cases:
>  # You want to read some data compressed with a codec that Spark does not 
> [load by 
> default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
>  # You want to read some data with a codec that overrides one of the built-in 
> codecs that Spark supports.
>  # You want to explicitly instruct Spark on what codec to use on read when it 
> will not be able to correctly auto-detect it (e.g. because the file extension 
> is [missing,|https://stackoverflow.com/q/52011697/877069] 
> [non-standard|https://stackoverflow.com/q/44372995/877069], or 
> [incorrect|https://stackoverflow.com/q/49110384/877069]).
> Case #2 came up in SPARK-29102. There is a very handy library called 
> [SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
> load a single gzipped file using multiple concurrent tasks. (You can see the 
> details of how it works and why it's useful in the project README and in 
> SPARK-29102.)
> To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
> Hadoop filesystem API setting, since it [doesn't appear to be documented by 
> Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
> there is also a setting called {{spark.io.compression.codec}}, which seems to 
> be for a different purpose.
> It would be much clearer for the user and more consistent with the writer 
> interface if the reader let you directly specify the codec.
> For example, I think all of the following should be possible:
> {code:python}
> spark.read.option('compression', 'lz4').csv(...)
> spark.read.csv(..., 
> compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec')
> spark.read.json(..., compression='none')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29280) DataFrameReader should support a compression option

2019-09-27 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-29280:
-
Description: 
[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example, I think all of the following should be possible:
{code:python}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec')
spark.read.json(..., compression='none')
{code}

  was:
[DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
 supports a {{compression}} option, but 
[DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
 doesn't. The lack of a {{compression}} option in the reader causes some 
friction in the following cases:
 # You want to read some data compressed with a codec that Spark does not [load 
by 
default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
 # You want to read some data with a codec that overrides one of the built-in 
codecs that Spark supports.
 # You want to explicitly instruct Spark on what codec to use on read when it 
will not be able to correctly auto-detect it (e.g. because the file extension 
is [missing,|https://stackoverflow.com/q/52011697/877069] 
[non-standard|https://stackoverflow.com/q/44372995/877069], or 
[incorrect|https://stackoverflow.com/q/49110384/877069]).

Case #2 came up in SPARK-29102. There is a very handy library called 
[SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
load a single gzipped file using multiple concurrent tasks. (You can see the 
details of how it works and why it's useful in the project README and in 
SPARK-29102.)

To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
Hadoop filesystem API setting, since it [doesn't appear to be documented by 
Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
there is also a setting called {{spark.io.compression.codec}}, which seems to 
be for a different purpose.

It would be much clearer for the user and more consistent with the writer 
interface if the reader let you directly specify the codec.

For example:
{code:java}
spark.read.option('compression', 'lz4').csv(...)
spark.read.csv(..., 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}


> DataFrameReader should support a compression option
> ---
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
>  supports a {{compression}} option, but 
> [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
>  doesn't. The lack of a {{compression}} option in the reader causes some 
>