Re: Spark <--> S3 flakiness

lucas.g...@gmail.com Wed, 17 May 2017 21:31:11 -0700

Steve, just to clarify:

"FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is
way better on high-performance reads, especially if you are working with
column data and can set the fs.s3a.experimental.fadvise=random option. "


Are you talking about the hadoop-aws lib or hadoop itself.  I see that
spark is currently only pre-built against hadoop 2.7.

Most of our failures are on write, the other fix I've seen advertised has
been: "fileoutputcommitter.algorithm.version=2"

Still doing some reading and will start testing in the next day or so.

Thanks!

Gary

On 17 May 2017 at 03:19, Steve Loughran <ste...@hortonworks.com> wrote:

>
> On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote:
>
> Steve, thanks for the reply.  Digging through all the documentation now.
>
> Much appreciated!
>
>
>
> FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is
> way better on high-performance reads, especially if you are working with
> column data and can set the fs.s3a.experimental.fadvise=random option.
>
> That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest
> versions of CDH, even if their docs don't mention it
>
> https://hortonworks.github.io/hdp-aws/s3-performance/
> https://www.cloudera.com/documentation/enterprise/5-9-
> x/topics/spark_s3.html
>
>
> On 16 May 2017 at 10:10, Steve Loughran <ste...@hortonworks.com> wrote:
>
>>
>> On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote:
>>
>> Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
>> intermediate steps and final output of parquet files.
>>
>>
>> Please don't, not without a committer specially written to work against
>> S3 in the presence of failures.You are at risk of things going wrong and
>> you not even noticing.
>>
>> The only one that I trust to do this right now is;
>> https://github.com/rdblue/s3committer
>>
>>
>> see also : https://github.com/apache/spark/blob/master/docs/cloud-int
>> egration.md
>>
>>
>>
>> We're running into the following issues on a semi regular basis:
>> * These are intermittent errors, IE we have about 300 jobs that run
>> nightly... And a fairly random but small-ish percentage of them fail with
>> the following classes of errors.
>>
>>
>> *S3 write errors *
>>
>>> "ERROR Utils: Aborting task
>>> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404,
>>> AWS Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
>>> Error Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>>>
>>
>>
>>> "Py4JJavaError: An error occurred while calling o43.parquet.
>>> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
>>> Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
>>> Error Message: One or more objects could not be deleted, S3 Extended
>>> Request ID: null"
>>
>>
>>
>>
>> *S3 Read Errors: *
>>
>>> [Stage 1:=================================================>       (27 +
>>> 4) / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage
>>> 1.0 (TID 11)
>>> java.net.SocketException: Connection reset
>>> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>>> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>>> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>>> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>>> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>>> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>>> at org.apache.http.impl.io.AbstractSessionInputBuffer.read(Abst
>>> ractSessionInputBuffer.java:198)
>>> at org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:178)
>>> at org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:200)
>>> at org.apache.http.impl.io.ContentLengthInputStream.close(Conte
>>> ntLengthInputStream.java:103)
>>> at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicMa
>>> nagedEntity.java:168)
>>> at org.apache.http.conn.EofSensorInputStream.checkClose(EofSens
>>> orInputStream.java:228)
>>> at org.apache.http.conn.EofSensorInputStream.close(EofSensorInp
>>> utStream.java:174)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>>> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream
>>> .java:187)
>>
>>
>>
>> We have literally tons of logs we can add but it would make the email
>> unwieldy big.  If it would be helpful I'll drop them in a pastebin or
>> something.
>>
>> Our config is along the lines of:
>>
>>    - spark-2.1.0-bin-hadoop2.7
>>    - '--packages com.amazonaws:aws-java-sdk:1.1
>>    0.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
>>
>>
>> You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't
>> ready to play with. In particular, in a close() call it reads to the end of
>> the stream, which is a performance killer on large files. That stack trace
>> you see is from that same phase of operation, so should go away too.
>>
>> Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one
>> will probably cause link errors.
>> http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3
>>
>> Also: make sure Joda time >= 2.8.1 for Java 8
>>
>> If you go up to 2.8.0, and you still see the errors, file something
>> against HADOOP in JIRA
>>
>>
>> Given the stack overflow / googling I've been doing I know we're not the
>> only org with these issues but I haven't found a good set of solutions in
>> those spaces yet.
>>
>> Thanks!
>>
>> Gary Lucas
>>
>>
>>
>
>

Re: Spark <--> S3 flakiness

Reply via email to