[ https://issues.apache.org/jira/browse/SPARK-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514303#comment-15514303 ]
Josh Rosen commented on SPARK-16830: ------------------------------------ Do you have stacktraces from the failed block fetches? I'd like to see whether this may be fixed by a recent patch of mine which helps to avoid failures if all locations of non-shuffle blocks are lost / unavailable. > Executors Keep Trying to Fetch Blocks from a Bad Host > ----------------------------------------------------- > > Key: SPARK-16830 > URL: https://issues.apache.org/jira/browse/SPARK-16830 > Project: Spark > Issue Type: Bug > Components: Spark Core, Streaming > Affects Versions: 1.6.2 > Environment: EMR 4.7.2 > Reporter: Renxia Wang > > When a host became unreachable, driver removes the executors and block > managers on that hosts because it doesn't receive heartbeats. However, > executors on other hosts still keep trying to fetch blocks from the bad > hosts. > I am running a Spark Streaming job to consume data from Kinesis. As a result > of this block fetch retrying and failing, I started seeing > ProvisionedThroughputExceededException on shards, AmazonHttpClient (to > Kinesis) SocketException, Kinesis ExpiredIteratorException etc. > This issue also expose a potential memory leak. Starting from the time that > the bad host became unreachable, the physical memory usages of executors that > keep trying to fetch block from the bad host started increasing and finally > hit the physical memory limit and killed by YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org