Re: Spark will process _temporary folder on S3 is very slow and always cause failure

Aaron Davidson Tue, 17 Mar 2015 12:10:22 -0700

Actually, this is the more relevant JIRA (which is resolved):
https://issues.apache.org/jira/browse/SPARK-3595


6352 is about saveAsParquetFile, which is not in use here.

Here is a DirectOutputCommitter implementation:
https://gist.github.com/aarondav/c513916e72101bbe14ec

and it can be configured in Spark with:
sparkConf.set("spark.hadoop.mapred.output.committer.class",
classOf[DirectOutputCommitter].getName)

On Tue, Mar 17, 2015 at 8:05 AM, Imran Rashid <iras...@cloudera.com> wrote:

> I'm not super familiar w/ S3, but I think the issue is that you want to
> use a different output committers with "object" stores, that don't have a
> simple move operation.  There have been a few other threads on S3 &
> outputcommitters.  I think the most relevant for you is most probably this
> open JIRA:
>
> https://issues.apache.org/jira/browse/SPARK-6352
>
> On Fri, Mar 13, 2015 at 5:51 PM, Shuai Zheng <szheng.c...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>
>>
>> I try to run a sorting on a r3.2xlarge instance on AWS. I just try to run
>> it as a single node cluster for test. The data I use to sort is around 4GB
>> and sit on S3, output will also on S3.
>>
>>
>>
>> I just connect spark-shell to the local cluster and run the code in the
>> script (because I just want a benchmark now).
>>
>>
>>
>> My job is as simple as:
>>
>> val parquetFile =
>> sqlContext.parquetFile("s3n://...,s3n://...,s3n://...,s3n://...,s3n://...,s3n://...,s3n://...,")
>>
>> parquetFile.registerTempTable("Test")
>>
>> val sortedResult = sqlContext.sql("SELECT * FROM Test order by time").map
>> { row => { row.mkString("\t") } }
>>
>> sortedResult.saveAsTextFile("s3n://myplace,");
>>
>>
>>
>> The job takes around 6 mins to finish the sort when I am monitoring the
>> process. After I notice the process stop at:
>>
>>
>>
>> 15/03/13 22:38:27 INFO DAGScheduler: Job 2 finished: saveAsTextFile at
>> <console>:31, took 581.304992 s
>>
>>
>>
>> At that time, the spark actually just write all the data to the
>> _temporary folder first, after all sub-tasks finished, it will try to move
>> all the ready result from _temporary folder to the final location. This
>> process might be quick locally (because it will just be a cut/paste), but
>> it looks like very slow on my S3, it takes a few second to move one file
>> (usually there will be 200 partitions). And then it raise exceptions after
>> it move might be 40-50 files.
>>
>>
>>
>> org.apache.http.NoHttpResponseException: The target server failed to
>> respond
>>
>>         at
>> org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:101)
>>
>>         at
>> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252)
>>
>>         at
>> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:281)
>>
>>         at
>> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247)
>>
>>         at
>> org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:219)
>>
>>
>>
>>
>>
>> I try several times, but never get the full job finished. I am not sure
>> anything wrong here, but I use something very basic and I can see the job
>> has finished and all result on the S3 under temporary folder, but then it
>> raise the exception and fail.
>>
>>
>>
>> Any special setting I should do here when deal with S3?
>>
>>
>>
>> I don’t know what is the issue here, I never see MapReduce has similar
>> issue. So it could not be S3’s problem.
>>
>>
>>
>> Regards,
>>
>>
>>
>> Shuai
>>
>
>

Re: Spark will process _temporary folder on S3 is very slow and always cause failure

Reply via email to