[
https://issues.apache.org/jira/browse/SPARK-31177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768464#comment-17768464
]
Mark Waddle commented on SPARK-31177:
-
[~Minskya]the resolution is “incomplete”, so I don’t think it’s fixed. I worked
around it by renaming files to end in .gz extension.
> DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has
> non-".gz" extension
> --
>
> Key: SPARK-31177
> URL: https://issues.apache.org/jira/browse/SPARK-31177
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Mark Waddle
>Priority: Major
> Labels: bulk-closed
>
> i have large CSV files that are gzipped and uploaded to S3 with
> Content-Encoding=gzip. the files have file extension ".csv", as most web
> clients will automatically decompress the file based on the Content-Encoding
> header. using pyspark to read these CSV files does not mimic this behavior.
> works as expected:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
> {code}
> does not decompress and tries to load entire contents of file as the first
> row:
> {code:java}
> df = spark.read.csv('s3://bucket/large.csv', header=True)
> {code}
> it looks like it's relying on the file extension to determine if the file is
> gzip compressed or not. it would be great if S3 resources, and any other http
> based resources, could consult the Content-Encoding response header as well.
> i tried to find the code that determines this, but i'm not familiar with the
> code base. any pointers would be helpful. and i can look into fixing it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org