[jira] [Commented] (SPARK-31177) DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has non-".gz" extension

2023-09-24 Thread Mark Waddle (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768464#comment-17768464
 ] 

Mark Waddle commented on SPARK-31177:
-

[~Minskya]the resolution is “incomplete”, so I don’t think it’s fixed. I worked 
around it by renaming files to end in .gz extension.

> DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has 
> non-".gz" extension
> --
>
> Key: SPARK-31177
> URL: https://issues.apache.org/jira/browse/SPARK-31177
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Mark Waddle
>Priority: Major
>  Labels: bulk-closed
>
> i have large CSV files that are gzipped and uploaded to S3 with 
> Content-Encoding=gzip. the files have file extension ".csv", as most web 
> clients will automatically decompress the file based on the Content-Encoding 
> header. using pyspark to read these CSV files does not mimic this behavior.
> works as expected:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
> {code}
> does not decompress and tries to load entire contents of file as the first 
> row:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv', header=True)
> {code}
> it looks like it's relying on the file extension to determine if the file is 
> gzip compressed or not. it would be great if S3 resources, and any other http 
> based resources, could consult the Content-Encoding response header as well.
> i tried to find the code that determines this, but i'm not familiar with the 
> code base. any pointers would be helpful. and i can look into fixing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31177) DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has non-".gz" extension

2020-03-17 Thread Mark Waddle (Jira)
Mark Waddle created SPARK-31177:
---

 Summary: DataFrameReader.csv incorrectly reads gzip encoded CSV 
from S3 when it has non-".gz" extension
 Key: SPARK-31177
 URL: https://issues.apache.org/jira/browse/SPARK-31177
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Mark Waddle


i have large CSV files that are gzipped and uploaded to S3 with 
Content-Encoding=gzip. the files have file extension ".csv", as most web 
clients will automatically decompress the file based on the Content-Encoding 
header. using pyspark to read these CSV files does not mimic this behavior.

works as expected:
{code:java}
df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
{code}
does not decompress and tries to load entire contents of file as the first row:
{code:java}
df = spark.read.csv('s3://bucket/large.csv', header=True)
{code}

it looks like it's relying on the file extension to determine if the file is 
gzip compressed or not. it would be great if S3 resources, and any other http 
based resources, could consult the Content-Encoding response header as well.

i tried to find the code that determines this, but i'm not familiar with the 
code base. any pointers would be helpful. and i can look into fixing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org