Re: Spark interrupts S3 request backoff
+1 on the previous guess and additionally I suggest to reproduce it with vanilla Spark. Amazon Spark contains modifications which not available in vanilla Spark which makes problem hunting hard or impossible. Such case amazon can help... On Tue, Apr 14, 2020 at 11:20 AM ZHANG Wei wrote: > I will make a guess, it's not interruptted, it's killed by the driver or > the resource manager since the executor fallen into sleep for a long time. > > You may have to find the root cause in the driver and failed executor log > contexts. > > -- > Cheers, > -z > > > From: Lian Jiang > Sent: Monday, April 13, 2020 10:43 > To: user > Subject: Spark interrupts S3 request backoff > > Hi, > > My Spark job failed when reading parquet files from S3 due to 503 slow > down. According to > https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html, > I can use backoff to mitigate this issue. However, spark seems to interrupt > the backoff sleeping (see "sleep interrupted"). Is there a way (e.g. some > settings) to make spark not interrupt the backoff? Appreciate any hints. > > > > 20/04/12 20:15:37 WARN TaskSetManager: Lost task 3347.0 in stage 155.0 > (TID 128138, ip-100-101-44-35.us-west-2.compute.internal, executor 34): > org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to > download file path: > s3://mybucket/myprefix/part-00178-d0a0d51f-f98e-4b9d-8d00-bb3b9acd9a47-c000.snappy.parquet, > range: 0-19231, partition values: [empty row], isDataPresent: false > at > org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:248) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Suppressed: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; > Request ID: CECE220993AE7F89; S3 Extended Request ID: > UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=), > S3 Extended Request ID: > UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y= > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) > at > com.am
Re: Spark interrupts S3 request backoff
I will make a guess, it's not interruptted, it's killed by the driver or the resource manager since the executor fallen into sleep for a long time. You may have to find the root cause in the driver and failed executor log contexts. -- Cheers, -z From: Lian Jiang Sent: Monday, April 13, 2020 10:43 To: user Subject: Spark interrupts S3 request backoff Hi, My Spark job failed when reading parquet files from S3 due to 503 slow down. According to https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html, I can use backoff to mitigate this issue. However, spark seems to interrupt the backoff sleeping (see "sleep interrupted"). Is there a way (e.g. some settings) to make spark not interrupt the backoff? Appreciate any hints. 20/04/12 20:15:37 WARN TaskSetManager: Lost task 3347.0 in stage 155.0 (TID 128138, ip-100-101-44-35.us-west-2.compute.internal, executor 34): org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://mybucket/myprefix/part-00178-d0a0d51f-f98e-4b9d-8d00-bb3b9acd9a47-c000.snappy.parquet, range: 0-19231, partition values: [empty row], isDataPresent: false at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:248) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: CECE220993AE7F89; S3 Extended Request ID: UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=), S3 Extended Request ID: UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532
Spark interrupts S3 request backoff
Hi, My Spark job failed when reading parquet files from S3 due to 503 slow down. According to https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html, I can use backoff to mitigate this issue. However, spark seems to interrupt the backoff sleeping (see "sleep interrupted"). Is there a way (e.g. some settings) to make spark not interrupt the backoff? Appreciate any hints. 20/04/12 20:15:37 WARN TaskSetManager: Lost task 3347.0 in stage 155.0 (TID 128138, ip-100-101-44-35.us-west-2.compute.internal, executor 34): org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://mybucket/myprefix/part-00178-d0a0d51f-f98e-4b9d-8d00-bb3b9acd9a47-c000.snappy.parquet, range: 0-19231, partition values: [empty row], isDataPresent: false at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:248) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:172) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: CECE220993AE7F89; S3 Extended Request ID: UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y=), S3 Extended Request ID: UlQe4dEuBR1YWJUthSlrbV9phyqxUNHQEw7tsJ5zu+oNIH+nGlGHfAv7EKkQRUVP8tw8x918A4Y= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4926) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4872) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3