Hi all,
My team has been experiencing a recurring unpredictable bug where only a
partial write to CSV in S3 on one partition of our Dataset is performed. For
example, in a Dataset of 10 partitions written to CSV in S3, we might see 9
of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the
job does not exit with an error code.

This becomes problematic in the following ways:
1. When we copy the data to Redshift, we get a bad decrypt error on the
partial file, suggesting that the failure occurred at a weird byte in the
file. 
2. We lose data - sometimes as much as 10%.

We don't see this problem with parquet format, which we also use, but moving
all of our data to parquet is not currently feasible. We're using the Java
API with Spark 2.2 and Amazon EMR 5.8, code is a simple as this:
df.write().csv("s3://some-bucket/some_location"). We're experiencing the
issue 1-3x/week on a daily job and are unable to reliably reproduce the
problem. 

Any thoughts on why we might be seeing this and how to resolve?
Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to