[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics
[ https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421327#comment-15421327 ] Nezih Yigitbasi commented on SPARK-16158: - [~andrewor14] [~rxin] how do you guys feel about this? > Support pluggable dynamic allocation heuristics > --- > > Key: SPARK-16158 > URL: https://issues.apache.org/jira/browse/SPARK-16158 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nezih Yigitbasi > > It would be nice if Spark supports plugging in custom dynamic allocation > heuristics. This feature would be useful for experimenting with new > heuristics and also useful for plugging in different heuristics per job etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics
[ https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346744#comment-15346744 ] Nezih Yigitbasi commented on SPARK-16158: - Thanks [~sowen] for your input, I understand your concern. Our end goal is to experiment with different heuristics and provide some of these out of the box where users can pick any of them depending on their workload, what they want to optimize, etc. So this is really the first step of this investigation. The main reason we want to explore new heuristics is some shortcomings of the default heuristic that we noticed with some of our jobs. For example, if a job has short tasks (say a few hundred ms) the exponential ramp up logic results in a large number of executors staying idle (by the time containers are allocated the tasks were done). Another shortcoming we noticed is at stage boundaries if there is a straggler, which is not uncommon for complex production jobs that we have, the default heuristic kills all the executors as most of them are idle, and when the stage is done it takes some time to ramp up again to a decent capacity (so the ramp down/decay process should be more "gentle"). I wonder what other users/committers think too. Especially if other users can share their production experience with the default dynamic allocation heuristic that would be super helpful for this discussion. > Support pluggable dynamic allocation heuristics > --- > > Key: SPARK-16158 > URL: https://issues.apache.org/jira/browse/SPARK-16158 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nezih Yigitbasi > > It would be nice if Spark supports plugging in custom dynamic allocation > heuristics. This feature would be useful for experimenting with new > heuristics and also useful for plugging in different heuristics per job etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16158) Support pluggable dynamic allocation heuristics
Nezih Yigitbasi created SPARK-16158: --- Summary: Support pluggable dynamic allocation heuristics Key: SPARK-16158 URL: https://issues.apache.org/jira/browse/SPARK-16158 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Nezih Yigitbasi It would be nice if Spark supports plugging in custom dynamic allocation heuristics. This feature would be useful for experimenting with new heuristics and also useful for plugging in different heuristics per job etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334290#comment-15334290 ] Nezih Yigitbasi commented on SPARK-15782: - just opened a new one. I was busy with testing the PR on both 2.10 and 2.11. > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332797#comment-15332797 ] Nezih Yigitbasi commented on SPARK-15782: - reopened, will submit a PR including Marcelo's fix on top of mine. > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nezih Yigitbasi reopened SPARK-15782: - > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi >Priority: Blocker > Fix For: 2.0.0 > > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15782) --packages doesn't work with the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nezih Yigitbasi updated SPARK-15782: Summary: --packages doesn't work with the spark-shell (was: --packages doesn't work the spark-shell) > --packages doesn't work with the spark-shell > > > Key: SPARK-15782 > URL: https://issues.apache.org/jira/browse/SPARK-15782 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Nezih Yigitbasi > > When {{--packages}} is specified with {{spark-shell}} the classes from those > packages cannot be found, which I think is due to some of the changes in > {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the > {{spark.jars}} system property in client mode, which is used by the repl main > class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15782) --packages doesn't work the spark-shell
Nezih Yigitbasi created SPARK-15782: --- Summary: --packages doesn't work the spark-shell Key: SPARK-15782 URL: https://issues.apache.org/jira/browse/SPARK-15782 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Nezih Yigitbasi When {{--packages}} is specified with {{spark-shell}} the classes from those packages cannot be found, which I think is due to some of the changes in {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the {{spark.jars}} system property in client mode, which is used by the repl main class to set the classpath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14042) Add support for custom coalescers
Nezih Yigitbasi created SPARK-14042: --- Summary: Add support for custom coalescers Key: SPARK-14042 URL: https://issues.apache.org/jira/browse/SPARK-14042 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Nezih Yigitbasi Per our discussion on the mailing list (please see [here|http://mail-archives.apache.org/mod_mbox//spark-dev/201602.mbox/%3CCA+g63F7aVRBH=WyyK3nvBSLCMPtSdUuL_Ge9=ww4dnmnvy4...@mail.gmail.com%3E]) it would be nice to specify a custom coalescing policy as the current {{coalesce()}} method only allows the user to specify the number of partitions and we cannot really control much. The need for this feature popped up when I wanted to merge small files by coalescing them by size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200543#comment-15200543 ] Nezih Yigitbasi commented on SPARK-11293: - [~joshrosen] any plans to fix this? I believe we are hitting this issue with {{1.6.0}} {code} 16/03/17 21:38:38 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.shuffle.sort.ShuffleExternalSorter@6ecba8f1: 32.0 KB 16/03/17 21:38:38 INFO memory.TaskMemoryManager: 1528015093 bytes of memory were used by task 103134 but are not associated with specific consumers 16/03/17 21:38:38 INFO memory.TaskMemoryManager: 1528047861 bytes of memory are used for execution and 80608434 bytes of memory are used for storage 16/03/17 21:38:38 ERROR executor.Executor: Managed memory leak detected; size = 1528015093 bytes, TID = 103134 16/03/17 21:38:38 ERROR executor.Executor: Exception in task 448.0 in stage 273.0 (TID 103134) java.lang.OutOfMemoryError: Unable to acquire 128 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:354) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:375) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/03/17 21:38:38 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.OutOfMemoryError: Unable to acquire 128 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:354) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:375) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/03/17 21:38:38 INFO storage.DiskBlockManager: Shutdown hook called 16/03/17 21:38:38 INFO util.ShutdownHookManager: Shutdown hook called {code} > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12221) Add CPU time metric to TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-12221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176697#comment-15176697 ] Nezih Yigitbasi commented on SPARK-12221: - any plans to get this in? > Add CPU time metric to TaskMetrics > -- > > Key: SPARK-12221 > URL: https://issues.apache.org/jira/browse/SPARK-12221 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 1.5.2 >Reporter: Jisoo Kim > > Currently TaskMetrics doesn't support executor CPU time. I'd like to have one > so I can retrieve the metric from History Server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13328) Possible Poor read performance for broadcast variables with dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nezih Yigitbasi updated SPARK-13328: Summary: Possible Poor read performance for broadcast variables with dynamic resource allocation (was: Poor read performance for broadcast variables with dynamic resource allocation) > Possible Poor read performance for broadcast variables with dynamic resource > allocation > --- > > Key: SPARK-13328 > URL: https://issues.apache.org/jira/browse/SPARK-13328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Nezih Yigitbasi > > When dynamic resource allocation is enabled fetching broadcast variables from > removed executors were causing job failures and SPARK-9591 fixed this problem > by trying all locations of a block before giving up. However, the locations > of a block is retrieved only once from the driver in this process and the > locations in this list can be stale due to dynamic resource allocation. This > situation gets worse when running on a large cluster as the size of this > location list can be in the order of several hundreds out of which there may > be tens of stale entries. What we have observed is with the default settings > of 3 max retries and 5s between retries (that's 15s per location) the time it > takes to read a broadcast variable can be as high as ~17m (below log shows > the failed 70th block fetch attempt where each attempt takes 15s) > {code} > ... > 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block > broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, > 60675) (failed attempt 70) > ... > 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 18 took 1051049 ms > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nezih Yigitbasi updated SPARK-13328: Summary: Possible poor read performance for broadcast variables with dynamic resource allocation (was: Possible Poor read performance for broadcast variables with dynamic resource allocation) > Possible poor read performance for broadcast variables with dynamic resource > allocation > --- > > Key: SPARK-13328 > URL: https://issues.apache.org/jira/browse/SPARK-13328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Nezih Yigitbasi > > When dynamic resource allocation is enabled fetching broadcast variables from > removed executors were causing job failures and SPARK-9591 fixed this problem > by trying all locations of a block before giving up. However, the locations > of a block is retrieved only once from the driver in this process and the > locations in this list can be stale due to dynamic resource allocation. This > situation gets worse when running on a large cluster as the size of this > location list can be in the order of several hundreds out of which there may > be tens of stale entries. What we have observed is with the default settings > of 3 max retries and 5s between retries (that's 15s per location) the time it > takes to read a broadcast variable can be as high as ~17m (below log shows > the failed 70th block fetch attempt where each attempt takes 15s) > {code} > ... > 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block > broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, > 60675) (failed attempt 70) > ... > 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 18 took 1051049 ms > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147896#comment-15147896 ] Nezih Yigitbasi edited comment on SPARK-13328 at 2/15/16 11:30 PM: --- Although this long time can be reduced by decreasing the values of the {{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters it may not be desirable to reduce # of retries globally and also reducing retry wait may increase the load on the serving block manager. I already have a fix where I added a new config parameter {{spark.block.failures.beforeLocationRefresh}} that determines when to refresh the list of block locations from the driver while going through all these locations. In my fix this parameter is honored only when dynamic allocation is enabled and I set its default value to Int.MaxValue so that it doesn't change the behavior even if dynamic alloc. is enabled (as refreshing the location may not be necessary in small clusters). If you think such a fix is valuable I will be happy to create a PR. was (Author: nezihyigitbasi): Although this long time can be reduced by decreasing the values of the {{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters it may not be desirable to reduce # of retries globally and also reducing retry wait may increase the load on the serving block manager. I already have a fix where I added a new config parameter {{spark.block.failures.beforeLocationRefresh}} that determines when to refresh the list of block locations from the driver while going through all these locations. In my fix this parameter is honored only when dynamic allocation is enabled and I set its default value to Int.MaxValue so that it doesn't change the behavior even if dynamic alloc. is enabled (as refreshing the location may not be necessary in small clusters). > Poor read performance for broadcast variables with dynamic resource allocation > -- > > Key: SPARK-13328 > URL: https://issues.apache.org/jira/browse/SPARK-13328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Nezih Yigitbasi > > When dynamic resource allocation is enabled fetching broadcast variables from > removed executors were causing job failures and SPARK-9591 fixed this problem > by trying all locations of a block before giving up. However, the locations > of a block is retrieved only once from the driver in this process and the > locations in this list can be stale due to dynamic resource allocation. This > situation gets worse when running on a large cluster as the size of this > location list can be in the order of several hundreds out of which there may > be tens of stale entries. What we have observed is with the default settings > of 3 max retries and 5s between retries (that's 15s per location) the time it > takes to read a broadcast variable can be as high as ~17m (below log shows > the failed 70th block fetch attempt where each attempt takes 15s) > {code} > ... > 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block > broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, > 60675) (failed attempt 70) > ... > 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 18 took 1051049 ms > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147896#comment-15147896 ] Nezih Yigitbasi commented on SPARK-13328: - Although this long time can be reduced by decreasing the values of the {{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters it may not be desirable to reduce # of retries globally and also reducing retry wait may increase the load on the serving block manager. I already have a fix where I added a new config parameter {{spark.block.failures.beforeLocationRefresh}} that determines when to refresh the list of block locations from the driver while going through all these locations. In my fix this parameter is honored only when dynamic allocation is enabled and I set its default value to Int.MaxValue so that it doesn't change the behavior even if dynamic alloc. is enabled (as refreshing the location may not be necessary in small clusters). > Poor read performance for broadcast variables with dynamic resource allocation > -- > > Key: SPARK-13328 > URL: https://issues.apache.org/jira/browse/SPARK-13328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Nezih Yigitbasi > > When dynamic resource allocation is enabled fetching broadcast variables from > removed executors were causing job failures and SPARK-9591 fixed this problem > by trying all locations of a block before giving up. However, the locations > of a block is retrieved only once from the driver in this process and the > locations in this list can be stale due to dynamic resource allocation. This > situation gets worse when running on a large cluster as the size of this > location list can be in the order of several hundreds out of which there may > be tens of stale entries. What we have observed is with the default settings > of 3 max retries and 5s between retries (that's 15s per location) the time it > takes to read a broadcast variable can be as high as ~17m (below log shows > the failed 70th block fetch attempt where each attempt takes 15s) > {code} > ... > 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block > broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, > 60675) (failed attempt 70) > ... > 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 18 took 1051049 ms > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation
Nezih Yigitbasi created SPARK-13328: --- Summary: Poor read performance for broadcast variables with dynamic resource allocation Key: SPARK-13328 URL: https://issues.apache.org/jira/browse/SPARK-13328 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Reporter: Nezih Yigitbasi When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (below log shows the failed 70th block fetch attempt where each attempt takes 15s) {code} ... 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 60675) (failed attempt 70) ... 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 18 took 1051049 ms {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE
[ https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143328#comment-15143328 ] Nezih Yigitbasi commented on SPARK-8592: We still see this problem with 1.5.2 {code} 16/02/11 07:55:29 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://sparkDriver@10.148.8.235:55443/user/CoarseGrainedScheduler java.lang.NullPointerException at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:281) at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:281) at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:322) at java.lang.String.valueOf(String.java:2982) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:138) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:138) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:76) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:138) at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126) at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} > CoarseGrainedExecutorBackend: Cannot register with driver => NPE > > > Key: SPARK-8592 > URL: https://issues.apache.org/jira/browse/SPARK-8592 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.4.0 > Environment: Ubuntu 14.04, Scala 2.11, Java 8, >Reporter: Sjoerd Mulder >Assignee: Xu Chen >Priority: Minor > Fix For: 1.5.0 > > > I cannot reproduce this consistently but when submitting jobs just after > another finished it will not come up: > {code} > 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with > driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler > java.lang.NullPointerException > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute
[jira] [Commented] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE
[ https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143330#comment-15143330 ] Nezih Yigitbasi commented on SPARK-8592: Any ideas [~joshrosen]? > CoarseGrainedExecutorBackend: Cannot register with driver => NPE > > > Key: SPARK-8592 > URL: https://issues.apache.org/jira/browse/SPARK-8592 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.4.0 > Environment: Ubuntu 14.04, Scala 2.11, Java 8, >Reporter: Sjoerd Mulder >Assignee: Xu Chen >Priority: Minor > Fix For: 1.5.0 > > > I cannot reproduce this consistently but when submitting jobs just after > another finished it will not come up: > {code} > 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to > akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker > 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with > driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler > java.lang.NullPointerException > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273) > at > org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313) > at java.lang.String.valueOf(String.java:2982) > at > scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) > at org.apache.spark.Logging$class.logInfo(Logging.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127) > at > org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) > at > org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at > org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org