Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0
Hi Neil Yes! it helps!!! I do not see _temporary in console output anymore. saveAsTextFile is fast now. 2015-09-02 23:07:00,022 INFO [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 18.0 in stage 0.0 (TID 18) in 4398 ms on ip-10-0-24-103.ec2.internal (1/24) 2015-09-02 23:07:01,887 INFO [task-result-getter-2] scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 5.0 in stage 0.0 (TID 5) in 6282 ms on ip-10-0-26-14.ec2.internal (24/24) 2015-09-02 23:07:01,888 INFO [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 0 (saveAsTextFile at :22) finished in 6.319 s 2015-09-02 23:07:02,123 INFO [main] s3n.Jets3tNativeFileSystemStore (Jets3tNativeFileSystemStore.java:storeFile(141)) - s3.putObject foo-bar tmp/test40_141_24_406/_SUCCESS 0 Thank you! On Wed, Sep 2, 2015 at 12:54 AM, Neil Jonkers wrote: > Hi, > > Can you set the following parameters in your mapred-site.xml file please: > > > mapred.output.direct.EmrFileSystemtrue > > mapred.output.direct.NativeS3FileSystemtrue > > You can also config this at cluster launch time with the following > Classification via EMR console: > > > classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true] > > > Thank you > > On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov > wrote: > >> I checked previous emr config (emr-3.8) >> mapred-site.xml has the following setting >> >> mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter >> >> >> >> On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov > > wrote: >> >>> Should I use DirectOutputCommitter? >>> spark.hadoop.mapred.output.committer.class >>> com.appsflyer.spark.DirectOutputCommitter >>> >>> >>> >>> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov < >>> apivova...@gmail.com> wrote: >>> I run spark 1.4.1 in amazom aws emr 4.0.0 For some reason spark saveAsTextFile is very slow on emr 4.0.0 in comparison to emr 3.8 (was 5 sec, now 95 sec) Actually saveAsTextFile says that it's done in 4.356 sec but after that I see lots of INFO messages with 404 error from com.amazonaws.latency logger for next 90 sec spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" + "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 (saveAsTextFile at :22) finished in 4.356 s 2015-09-01 21:16:17,637 INFO [task-result-getter-2] cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all completed, from pool 2015-09-01 21:16:17,637 INFO [main] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at :22, took 4.547829 s 2015-09-01 21:16:17,638 INFO [main] s3n.S3NativeFileSystem (S3NativeFileSystem.java:listStatus(896)) - listStatus s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false 2015-09-01 21:16:17,651 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 3B2F06FD11682D22), S3 Extended Request ID: C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[ https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129], 2015-09-01 21:16:17,723 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[ https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927], HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81], RequestSigningTime=[0.209], ResponseProcessingTime=[17.97], HttpClientSendRequestTime=[0.089], 2015-09-01 21:16:17,756 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 62C6B413965447FD), S3 Extended Request ID: 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf], ServiceName=[Amazon
Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0
Hi, Can you set the following parameters in your mapred-site.xml file please: mapred.output.direct.EmrFileSystemtrue mapred.output.direct.NativeS3FileSystemtrue You can also config this at cluster launch time with the following Classification via EMR console: classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true] Thank you On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov wrote: > I checked previous emr config (emr-3.8) > mapred-site.xml has the following setting > > mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter > > > > On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov > wrote: > >> Should I use DirectOutputCommitter? >> spark.hadoop.mapred.output.committer.class >> com.appsflyer.spark.DirectOutputCommitter >> >> >> >> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov > > wrote: >> >>> I run spark 1.4.1 in amazom aws emr 4.0.0 >>> >>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in >>> comparison to emr 3.8 (was 5 sec, now 95 sec) >>> >>> Actually saveAsTextFile says that it's done in 4.356 sec but after that >>> I see lots of INFO messages with 404 error from com.amazonaws.latency >>> logger for next 90 sec >>> >>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" + >>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") >>> >>> 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop] >>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 >>> (saveAsTextFile at :22) finished in 4.356 s >>> 2015-09-01 21:16:17,637 INFO [task-result-getter-2] >>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, >>> whose tasks have all completed, from pool >>> 2015-09-01 21:16:17,637 INFO [main] scheduler.DAGScheduler >>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at >>> :22, took 4.547829 s >>> 2015-09-01 21:16:17,638 INFO [main] s3n.S3NativeFileSystem >>> (S3NativeFileSystem.java:listStatus(896)) - listStatus >>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false >>> 2015-09-01 21:16:17,651 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], >>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found >>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request >>> ID: 3B2F06FD11682D22), S3 Extended Request ID: >>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], >>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], >>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], Exception=1, >>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], >>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], >>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129], >>> 2015-09-01 21:16:17,723 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], >>> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, >>> RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927], >>> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81], >>> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97], >>> HttpClientSendRequestTime=[0.089], >>> 2015-09-01 21:16:17,756 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], >>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found >>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request >>> ID: 62C6B413965447FD), S3 Extended Request ID: >>> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf], >>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], >>> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], Exception=1, >>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044], >>> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743], >>> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138], >>> 2015-09-01 21:16:17,774 INFO [main] amazonaws.latency >>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], >>> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[ >>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, >>> RequestCount=1, HttpClientPoolPendingCount=0, >>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724], >>> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728], >>> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155], >>> HttpClientSendRequestTime=[0.068], >>> 2015-09-01 21:16:17,786 INFO [main] amazonaws.latency >>
Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0
I checked previous emr config (emr-3.8) mapred-site.xml has the following setting mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov wrote: > Should I use DirectOutputCommitter? > spark.hadoop.mapred.output.committer.class > com.appsflyer.spark.DirectOutputCommitter > > > > On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov > wrote: > >> I run spark 1.4.1 in amazom aws emr 4.0.0 >> >> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in >> comparison to emr 3.8 (was 5 sec, now 95 sec) >> >> Actually saveAsTextFile says that it's done in 4.356 sec but after that I >> see lots of INFO messages with 404 error from com.amazonaws.latency logger >> for next 90 sec >> >> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" + >> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") >> >> 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop] >> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 >> (saveAsTextFile at :22) finished in 4.356 s >> 2015-09-01 21:16:17,637 INFO [task-result-getter-2] >> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, >> whose tasks have all completed, from pool >> 2015-09-01 21:16:17,637 INFO [main] scheduler.DAGScheduler >> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at >> :22, took 4.547829 s >> 2015-09-01 21:16:17,638 INFO [main] s3n.S3NativeFileSystem >> (S3NativeFileSystem.java:listStatus(896)) - listStatus >> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false >> 2015-09-01 21:16:17,651 INFO [main] amazonaws.latency >> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], >> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found >> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request >> ID: 3B2F06FD11682D22), S3 Extended Request ID: >> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], >> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], >> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[ >> https://foo-bar.s3.amazonaws.com], Exception=1, >> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, >> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], >> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], >> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129], >> 2015-09-01 21:16:17,723 INFO [main] amazonaws.latency >> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], >> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[ >> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, >> RequestCount=1, HttpClientPoolPendingCount=0, >> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927], >> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81], >> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97], >> HttpClientSendRequestTime=[0.089], >> 2015-09-01 21:16:17,756 INFO [main] amazonaws.latency >> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], >> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found >> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request >> ID: 62C6B413965447FD), S3 Extended Request ID: >> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf], >> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], >> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[ >> https://foo-bar.s3.amazonaws.com], Exception=1, >> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, >> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044], >> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743], >> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138], >> 2015-09-01 21:16:17,774 INFO [main] amazonaws.latency >> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], >> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[ >> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, >> RequestCount=1, HttpClientPoolPendingCount=0, >> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724], >> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728], >> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155], >> HttpClientSendRequestTime=[0.068], >> 2015-09-01 21:16:17,786 INFO [main] amazonaws.latency >> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], >> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found >> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request >> ID: 4846575A1C373BB9), S3 Extended Request ID: >> aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E], >> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], >> AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[ >> https://foo-bar.s3.amazonaws.com], Exception=1, >> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClien
Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0
Should I use DirectOutputCommitter? spark.hadoop.mapred.output.committer.class com.appsflyer.spark.DirectOutputCommitter On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov wrote: > I run spark 1.4.1 in amazom aws emr 4.0.0 > > For some reason spark saveAsTextFile is very slow on emr 4.0.0 in > comparison to emr 3.8 (was 5 sec, now 95 sec) > > Actually saveAsTextFile says that it's done in 4.356 sec but after that I > see lots of INFO messages with 404 error from com.amazonaws.latency logger > for next 90 sec > > spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" + > "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") > > 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop] > scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 > (saveAsTextFile at :22) finished in 4.356 s > 2015-09-01 21:16:17,637 INFO [task-result-getter-2] cluster.YarnScheduler > (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all > completed, from pool > 2015-09-01 21:16:17,637 INFO [main] scheduler.DAGScheduler > (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at > :22, took 4.547829 s > 2015-09-01 21:16:17,638 INFO [main] s3n.S3NativeFileSystem > (S3NativeFileSystem.java:listStatus(896)) - listStatus > s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false > 2015-09-01 21:16:17,651 INFO [main] amazonaws.latency > (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], > Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found > (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request > ID: 3B2F06FD11682D22), S3 Extended Request ID: > C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], > ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], > AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[ > https://foo-bar.s3.amazonaws.com], Exception=1, > HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, > HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], > HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], > RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129], > 2015-09-01 21:16:17,723 INFO [main] amazonaws.latency > (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], > ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[ > https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, > RequestCount=1, HttpClientPoolPendingCount=0, > HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927], > HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81], > RequestSigningTime=[0.209], ResponseProcessingTime=[17.97], > HttpClientSendRequestTime=[0.089], > 2015-09-01 21:16:17,756 INFO [main] amazonaws.latency > (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], > Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found > (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request > ID: 62C6B413965447FD), S3 Extended Request ID: > 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf], > ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], > AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[ > https://foo-bar.s3.amazonaws.com], Exception=1, > HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, > HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044], > HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743], > RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138], > 2015-09-01 21:16:17,774 INFO [main] amazonaws.latency > (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], > ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[ > https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, > RequestCount=1, HttpClientPoolPendingCount=0, > HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724], > HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728], > RequestSigningTime=[0.148], ResponseProcessingTime=[0.155], > HttpClientSendRequestTime=[0.068], > 2015-09-01 21:16:17,786 INFO [main] amazonaws.latency > (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], > Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found > (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request > ID: 4846575A1C373BB9), S3 Extended Request ID: > aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E], > ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], > AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[ > https://foo-bar.s3.amazonaws.com], Exception=1, > HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, > HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531], > HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434], > RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13], > 2015-09-01 21:16:17,786 INFO [main] s3n.S3NativeFileSystem > (S3NativeFileSystem.java:listStatus(896))
spark 1.4.1 saveAsTextFile is slow on emr-4.0.0
I run spark 1.4.1 in amazom aws emr 4.0.0 For some reason spark saveAsTextFile is very slow on emr 4.0.0 in comparison to emr 3.8 (was 5 sec, now 95 sec) Actually saveAsTextFile says that it's done in 4.356 sec but after that I see lots of INFO messages with 404 error from com.amazonaws.latency logger for next 90 sec spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" + "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20") 2015-09-01 21:16:17,637 INFO [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 (saveAsTextFile at :22) finished in 4.356 s 2015-09-01 21:16:17,637 INFO [task-result-getter-2] cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all completed, from pool 2015-09-01 21:16:17,637 INFO [main] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at :22, took 4.547829 s 2015-09-01 21:16:17,638 INFO [main] s3n.S3NativeFileSystem (S3NativeFileSystem.java:listStatus(896)) - listStatus s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false 2015-09-01 21:16:17,651 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 3B2F06FD11682D22), S3 Extended Request ID: C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[ https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129], 2015-09-01 21:16:17,723 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[ https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927], HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81], RequestSigningTime=[0.209], ResponseProcessingTime=[17.97], HttpClientSendRequestTime=[0.089], 2015-09-01 21:16:17,756 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 62C6B413965447FD), S3 Extended Request ID: 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[ https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044], HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743], RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138], 2015-09-01 21:16:17,774 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[ https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724], HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728], RequestSigningTime=[0.148], ResponseProcessingTime=[0.155], HttpClientSendRequestTime=[0.068], 2015-09-01 21:16:17,786 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 4846575A1C373BB9), S3 Extended Request ID: aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[ https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531], HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434], RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13], 2015-09-01 21:16:17,786 INFO [main] s3n.S3NativeFileSystem (S3NativeFileSystem.java:listStatus(896)) - listStatus s3n://foo-bar/tmp/test40_20/_temporary/0/task_201509012116_0005_m_00 with recursive false 2015-09-01 21:16:17,798 INFO [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Cod