Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-02 Thread Alexander Pivovarov
Hi Neil

Yes! it helps!!! I do  not see _temporary in console output anymore.
saveAsTextFile
is fast now.

2015-09-02 23:07:00,022 INFO  [task-result-getter-0]
scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 18.0
in stage 0.0 (TID 18) in 4398 ms on ip-10-0-24-103.ec2.internal (1/24)

2015-09-02 23:07:01,887 INFO  [task-result-getter-2]
scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 5.0 in
stage 0.0 (TID 5) in 6282 ms on ip-10-0-26-14.ec2.internal (24/24)

2015-09-02 23:07:01,888 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 0
(saveAsTextFile at :22) finished in 6.319 s

2015-09-02 23:07:02,123 INFO  [main] s3n.Jets3tNativeFileSystemStore
(Jets3tNativeFileSystemStore.java:storeFile(141)) - s3.putObject foo-bar
tmp/test40_141_24_406/_SUCCESS 0


Thank you!

On Wed, Sep 2, 2015 at 12:54 AM, Neil Jonkers  wrote:

> Hi,
>
> Can you set the following parameters in your mapred-site.xml file please:
>
>
> mapred.output.direct.EmrFileSystemtrue
>
> mapred.output.direct.NativeS3FileSystemtrue
>
> You can also config this at cluster launch time with the following
> Classification via EMR console:
>
>
> classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true]
>
>
> Thank you
>
> On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov 
> wrote:
>
>> I checked previous emr config (emr-3.8)
>> mapred-site.xml has the following setting
>> 
>> mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter
>> 
>>
>>
>> On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov > > wrote:
>>
>>> Should I use DirectOutputCommitter?
>>> spark.hadoop.mapred.output.committer.class
>>>  com.appsflyer.spark.DirectOutputCommitter
>>>
>>>
>>>
>>> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov <
>>> apivova...@gmail.com> wrote:
>>>
 I run spark 1.4.1 in amazom aws emr 4.0.0

 For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
 comparison to emr 3.8  (was 5 sec, now 95 sec)

 Actually saveAsTextFile says that it's done in 4.356 sec but after that
 I see lots of INFO messages with 404 error from com.amazonaws.latency
 logger for next 90 sec

 spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
 "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")

 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
 scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
 (saveAsTextFile at :22) finished in 4.356 s
 2015-09-01 21:16:17,637 INFO  [task-result-getter-2]
 cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0,
 whose tasks have all completed, from pool
 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
 (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
 :22, took 4.547829 s
 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
 (S3NativeFileSystem.java:listStatus(896)) - listStatus
 s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
 (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
 Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
 (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
 ID: 3B2F06FD11682D22), S3 Extended Request ID:
 C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
 ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
 AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
 https://foo-bar.s3.amazonaws.com], Exception=1,
 HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
 HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
 HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
 RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
 (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
 ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
 https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
 RequestCount=1, HttpClientPoolPendingCount=0,
 HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
 HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
 RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
 HttpClientSendRequestTime=[0.089],
 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
 (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
 Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
 (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
 ID: 62C6B413965447FD), S3 Extended Request ID:
 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
 ServiceName=[Amazon 

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-02 Thread Neil Jonkers
Hi,

Can you set the following parameters in your mapred-site.xml file please:

mapred.output.direct.EmrFileSystemtrue
mapred.output.direct.NativeS3FileSystemtrue

You can also config this at cluster launch time with the following
Classification via EMR console:

classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true]


Thank you

On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov 
wrote:

> I checked previous emr config (emr-3.8)
> mapred-site.xml has the following setting
> 
> mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter
> 
>
>
> On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov 
> wrote:
>
>> Should I use DirectOutputCommitter?
>> spark.hadoop.mapred.output.committer.class
>>  com.appsflyer.spark.DirectOutputCommitter
>>
>>
>>
>> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov > > wrote:
>>
>>> I run spark 1.4.1 in amazom aws emr 4.0.0
>>>
>>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
>>> comparison to emr 3.8  (was 5 sec, now 95 sec)
>>>
>>> Actually saveAsTextFile says that it's done in 4.356 sec but after that
>>> I see lots of INFO messages with 404 error from com.amazonaws.latency
>>> logger for next 90 sec
>>>
>>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
>>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>>>
>>> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
>>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
>>> (saveAsTextFile at :22) finished in 4.356 s
>>> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2]
>>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0,
>>> whose tasks have all completed, from pool
>>> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
>>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
>>> :22, took 4.547829 s
>>> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
>>> (S3NativeFileSystem.java:listStatus(896)) - listStatus
>>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
>>> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>>> ID: 3B2F06FD11682D22), S3 Extended Request ID:
>>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
>>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], Exception=1,
>>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
>>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
>>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
>>> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>>> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>>> RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
>>> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
>>> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
>>> HttpClientSendRequestTime=[0.089],
>>> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>>> ID: 62C6B413965447FD), S3 Extended Request ID:
>>> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
>>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>>> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], Exception=1,
>>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
>>> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
>>> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
>>> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>>> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
>>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>>> RequestCount=1, HttpClientPoolPendingCount=0,
>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
>>> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
>>> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
>>> HttpClientSendRequestTime=[0.068],
>>> 2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
>>

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I checked previous emr config (emr-3.8)
mapred-site.xml has the following setting

mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter



On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov 
wrote:

> Should I use DirectOutputCommitter?
> spark.hadoop.mapred.output.committer.class
>  com.appsflyer.spark.DirectOutputCommitter
>
>
>
> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov 
> wrote:
>
>> I run spark 1.4.1 in amazom aws emr 4.0.0
>>
>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
>> comparison to emr 3.8  (was 5 sec, now 95 sec)
>>
>> Actually saveAsTextFile says that it's done in 4.356 sec but after that I
>> see lots of INFO messages with 404 error from com.amazonaws.latency logger
>> for next 90 sec
>>
>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>>
>> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
>> (saveAsTextFile at :22) finished in 4.356 s
>> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2]
>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0,
>> whose tasks have all completed, from pool
>> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
>> :22, took 4.547829 s
>> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
>> (S3NativeFileSystem.java:listStatus(896)) - listStatus
>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
>> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 3B2F06FD11682D22), S3 Extended Request ID:
>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
>> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>> RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
>> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
>> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
>> HttpClientSendRequestTime=[0.089],
>> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 62C6B413965447FD), S3 Extended Request ID:
>> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
>> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
>> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
>> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>> RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
>> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
>> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
>> HttpClientSendRequestTime=[0.068],
>> 2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 4846575A1C373BB9), S3 Extended Request ID:
>> aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClien

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
Should I use DirectOutputCommitter?
spark.hadoop.mapred.output.committer.class
 com.appsflyer.spark.DirectOutputCommitter



On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov 
wrote:

> I run spark 1.4.1 in amazom aws emr 4.0.0
>
> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
> comparison to emr 3.8  (was 5 sec, now 95 sec)
>
> Actually saveAsTextFile says that it's done in 4.356 sec but after that I
> see lots of INFO messages with 404 error from com.amazonaws.latency logger
> for next 90 sec
>
> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>
> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
> (saveAsTextFile at :22) finished in 4.356 s
> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler
> (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all
> completed, from pool
> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
> :22, took 4.547829 s
> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
> (S3NativeFileSystem.java:listStatus(896)) - listStatus
> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 3B2F06FD11682D22), S3 Extended Request ID:
> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
> RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
> HttpClientSendRequestTime=[0.089],
> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 62C6B413965447FD), S3 Extended Request ID:
> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
> RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
> HttpClientSendRequestTime=[0.068],
> 2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 4846575A1C373BB9), S3 Extended Request ID:
> aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531],
> HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434],
> RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13],
> 2015-09-01 21:16:17,786 INFO  [main] s3n.S3NativeFileSystem
> (S3NativeFileSystem.java:listStatus(896))

spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I run spark 1.4.1 in amazom aws emr 4.0.0

For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
comparison to emr 3.8  (was 5 sec, now 95 sec)

Actually saveAsTextFile says that it's done in 4.356 sec but after that I
see lots of INFO messages with 404 error from com.amazonaws.latency logger
for next 90 sec

spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
"A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")

2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
(saveAsTextFile at :22) finished in 4.356 s
2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler
(Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all
completed, from pool
2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
:22, took 4.547829 s
2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 3B2F06FD11682D22), S3 Extended Request ID:
C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
HttpClientSendRequestTime=[0.089],
2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 62C6B413965447FD), S3 Extended Request ID:
4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
HttpClientSendRequestTime=[0.068],
2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 4846575A1C373BB9), S3 Extended Request ID:
aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531],
HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434],
RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13],
2015-09-01 21:16:17,786 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://foo-bar/tmp/test40_20/_temporary/0/task_201509012116_0005_m_00
with recursive false
2015-09-01 21:16:17,798 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Cod