[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics

2016-08-15 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421327#comment-15421327
 ] 

Nezih Yigitbasi commented on SPARK-16158:
-

[~andrewor14] [~rxin] how do you guys feel about this?

> Support pluggable dynamic allocation heuristics
> ---
>
> Key: SPARK-16158
> URL: https://issues.apache.org/jira/browse/SPARK-16158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>
> It would be nice if Spark supports plugging in custom dynamic allocation 
> heuristics. This feature would be useful for experimenting with new 
> heuristics and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16158) Support pluggable dynamic allocation heuristics

2016-06-23 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346744#comment-15346744
 ] 

Nezih Yigitbasi commented on SPARK-16158:
-

Thanks [~sowen] for your input, I understand your concern. Our end goal is to 
experiment with different heuristics and provide some of these out of the box 
where users can pick any of them depending on their workload, what they want to 
optimize, etc. So this is really the first step of this investigation. The main 
reason we want to explore new heuristics is some shortcomings of the default 
heuristic that we noticed with some of our jobs. For example, if a job has 
short tasks (say a few hundred ms) the exponential ramp up logic results in a 
large number of executors staying idle (by the time containers are allocated 
the tasks were done). Another shortcoming we noticed is at stage boundaries if 
there is a straggler, which is not uncommon for complex production jobs that we 
have, the default heuristic kills all the executors as most of them are idle, 
and when the stage is done it takes some time to ramp up again to a decent 
capacity (so the ramp down/decay process should be more "gentle"). I wonder 
what other users/committers think too. Especially if other users can share 
their production experience with the default dynamic allocation heuristic that 
would be super helpful for this discussion.

> Support pluggable dynamic allocation heuristics
> ---
>
> Key: SPARK-16158
> URL: https://issues.apache.org/jira/browse/SPARK-16158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>
> It would be nice if Spark supports plugging in custom dynamic allocation 
> heuristics. This feature would be useful for experimenting with new 
> heuristics and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16158) Support pluggable dynamic allocation heuristics

2016-06-22 Thread Nezih Yigitbasi (JIRA)
Nezih Yigitbasi created SPARK-16158:
---

 Summary: Support pluggable dynamic allocation heuristics
 Key: SPARK-16158
 URL: https://issues.apache.org/jira/browse/SPARK-16158
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Nezih Yigitbasi


It would be nice if Spark supports plugging in custom dynamic allocation 
heuristics. This feature would be useful for experimenting with new heuristics 
and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-16 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334290#comment-15334290
 ] 

Nezih Yigitbasi commented on SPARK-15782:
-

just opened a new one. I was busy with testing the PR on both 2.10 and 2.11.

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332797#comment-15332797
 ] 

Nezih Yigitbasi commented on SPARK-15782:
-

reopened, will submit a PR including Marcelo's fix on top of mine.

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Nezih Yigitbasi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nezih Yigitbasi reopened SPARK-15782:
-

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-06 Thread Nezih Yigitbasi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nezih Yigitbasi updated SPARK-15782:

Summary: --packages doesn't work with the spark-shell  (was: --packages 
doesn't work the spark-shell)

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15782) --packages doesn't work the spark-shell

2016-06-06 Thread Nezih Yigitbasi (JIRA)
Nezih Yigitbasi created SPARK-15782:
---

 Summary: --packages doesn't work the spark-shell
 Key: SPARK-15782
 URL: https://issues.apache.org/jira/browse/SPARK-15782
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Nezih Yigitbasi


When {{--packages}} is specified with {{spark-shell}} the classes from those 
packages cannot be found, which I think is due to some of the changes in 
{{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
{{spark.jars}} system property in client mode, which is used by the repl main 
class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14042) Add support for custom coalescers

2016-03-21 Thread Nezih Yigitbasi (JIRA)
Nezih Yigitbasi created SPARK-14042:
---

 Summary: Add support for custom coalescers
 Key: SPARK-14042
 URL: https://issues.apache.org/jira/browse/SPARK-14042
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Nezih Yigitbasi


Per our discussion on the mailing list (please see 
[here|http://mail-archives.apache.org/mod_mbox//spark-dev/201602.mbox/%3CCA+g63F7aVRBH=WyyK3nvBSLCMPtSdUuL_Ge9=ww4dnmnvy4...@mail.gmail.com%3E])
 it would be nice to specify a custom coalescing policy as the current 
{{coalesce()}} method only allows the user to specify the number of partitions 
and we cannot really control much. The need for this feature popped up when I 
wanted to merge small files by coalescing them by size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-03-19 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200543#comment-15200543
 ] 

Nezih Yigitbasi commented on SPARK-11293:
-

[~joshrosen] any plans to fix this? I believe we are hitting this issue with 
{{1.6.0}}

{code}
16/03/17 21:38:38 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@6ecba8f1: 32.0 KB
16/03/17 21:38:38 INFO memory.TaskMemoryManager: 1528015093 bytes of memory 
were used by task 103134 but are not associated with specific consumers
16/03/17 21:38:38 INFO memory.TaskMemoryManager: 1528047861 bytes of memory are 
used for execution and 80608434 bytes of memory are used for storage
16/03/17 21:38:38 ERROR executor.Executor: Managed memory leak detected; size = 
1528015093 bytes, TID = 103134
16/03/17 21:38:38 ERROR executor.Executor: Exception in task 448.0 in stage 
273.0 (TID 103134)
java.lang.OutOfMemoryError: Unable to acquire 128 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:354)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:375)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/03/17 21:38:38 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception 
in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: Unable to acquire 128 bytes of memory, got 0
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:354)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:375)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/03/17 21:38:38 INFO storage.DiskBlockManager: Shutdown hook called
16/03/17 21:38:38 INFO util.ShutdownHookManager: Shutdown hook called
{code}

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12221) Add CPU time metric to TaskMetrics

2016-03-02 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176697#comment-15176697
 ] 

Nezih Yigitbasi commented on SPARK-12221:
-

any plans to get this in?

> Add CPU time metric to TaskMetrics
> --
>
> Key: SPARK-12221
> URL: https://issues.apache.org/jira/browse/SPARK-12221
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.2
>Reporter: Jisoo Kim
>
> Currently TaskMetrics doesn't support executor CPU time. I'd like to have one 
> so I can retrieve the metric from History Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13328) Possible Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nezih Yigitbasi updated SPARK-13328:

Summary: Possible Poor read performance for broadcast variables with 
dynamic resource allocation  (was: Poor read performance for broadcast 
variables with dynamic resource allocation)

> Possible Poor read performance for broadcast variables with dynamic resource 
> allocation
> ---
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nezih Yigitbasi updated SPARK-13328:

Summary: Possible poor read performance for broadcast variables with 
dynamic resource allocation  (was: Possible Poor read performance for broadcast 
variables with dynamic resource allocation)

> Possible poor read performance for broadcast variables with dynamic resource 
> allocation
> ---
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147896#comment-15147896
 ] 

Nezih Yigitbasi edited comment on SPARK-13328 at 2/15/16 11:30 PM:
---

Although this long time can be reduced by decreasing the values of the 
{{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters 
it may not be desirable to reduce # of retries globally and also reducing retry 
wait may increase the load on the serving block manager. 

I already have a fix where I added a new config parameter 
{{spark.block.failures.beforeLocationRefresh}} that determines when to refresh 
the list of block locations from the driver while going through all these 
locations. In my fix this parameter is honored only when dynamic allocation is 
enabled and I set its default value to Int.MaxValue so that it doesn't change 
the behavior even if dynamic alloc. is enabled (as refreshing the location may 
not be necessary in small clusters).

If you think such a fix is valuable I will be happy to create a PR.


was (Author: nezihyigitbasi):
Although this long time can be reduced by decreasing the values of the 
{{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters 
it may not be desirable to reduce # of retries globally and also reducing retry 
wait may increase the load on the serving block manager. 

I already have a fix where I added a new config parameter 
{{spark.block.failures.beforeLocationRefresh}} that determines when to refresh 
the list of block locations from the driver while going through all these 
locations. In my fix this parameter is honored only when dynamic allocation is 
enabled and I set its default value to Int.MaxValue so that it doesn't change 
the behavior even if dynamic alloc. is enabled (as refreshing the location may 
not be necessary in small clusters).

> Poor read performance for broadcast variables with dynamic resource allocation
> --
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147896#comment-15147896
 ] 

Nezih Yigitbasi commented on SPARK-13328:
-

Although this long time can be reduced by decreasing the values of the 
{{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters 
it may not be desirable to reduce # of retries globally and also reducing retry 
wait may increase the load on the serving block manager. 

I already have a fix where I added a new config parameter 
{{spark.block.failures.beforeLocationRefresh}} that determines when to refresh 
the list of block locations from the driver while going through all these 
locations. In my fix this parameter is honored only when dynamic allocation is 
enabled and I set its default value to Int.MaxValue so that it doesn't change 
the behavior even if dynamic alloc. is enabled (as refreshing the location may 
not be necessary in small clusters).

> Poor read performance for broadcast variables with dynamic resource allocation
> --
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)
Nezih Yigitbasi created SPARK-13328:
---

 Summary: Poor read performance for broadcast variables with 
dynamic resource allocation
 Key: SPARK-13328
 URL: https://issues.apache.org/jira/browse/SPARK-13328
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Nezih Yigitbasi


When dynamic resource allocation is enabled fetching broadcast variables from 
removed executors were causing job failures and SPARK-9591 fixed this problem 
by trying all locations of a block before giving up. However, the locations of 
a block is retrieved only once from the driver in this process and the 
locations in this list can be stale due to dynamic resource allocation. This 
situation gets worse when running on a large cluster as the size of this 
location list can be in the order of several hundreds out of which there may be 
tens of stale entries. What we have observed is with the default settings of 3 
max retries and 5s between retries (that's 15s per location) the time it takes 
to read a broadcast variable can be as high as ~17m (below log shows the failed 
70th block fetch attempt where each attempt takes 15s)

{code}
...
16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 60675) 
(failed attempt 70)
...
16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
18 took 1051049 ms
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE

2016-02-11 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143328#comment-15143328
 ] 

Nezih Yigitbasi commented on SPARK-8592:


We still see this problem with 1.5.2

{code}
16/02/11 07:55:29 ERROR executor.CoarseGrainedExecutorBackend: Cannot register 
with driver: 
akka.tcp://sparkDriver@10.148.8.235:55443/user/CoarseGrainedScheduler
java.lang.NullPointerException
at 
org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:281)
at 
org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:281)
at 
org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:322)
at java.lang.String.valueOf(String.java:2982)
at 
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:138)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:138)
at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:76)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:138)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

> CoarseGrainedExecutorBackend: Cannot register with driver => NPE
> 
>
> Key: SPARK-8592
> URL: https://issues.apache.org/jira/browse/SPARK-8592
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Scala 2.11, Java 8, 
>Reporter: Sjoerd Mulder
>Assignee: Xu Chen
>Priority: Minor
> Fix For: 1.5.0
>
>
> I cannot reproduce this consistently but when submitting jobs just after 
> another finished it will not come up:
> {code}
> 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with 
> driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler
> java.lang.NullPointerException
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute

[jira] [Commented] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver => NPE

2016-02-11 Thread Nezih Yigitbasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143330#comment-15143330
 ] 

Nezih Yigitbasi commented on SPARK-8592:


Any ideas [~joshrosen]?

> CoarseGrainedExecutorBackend: Cannot register with driver => NPE
> 
>
> Key: SPARK-8592
> URL: https://issues.apache.org/jira/browse/SPARK-8592
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.4.0
> Environment: Ubuntu 14.04, Scala 2.11, Java 8, 
>Reporter: Sjoerd Mulder
>Assignee: Xu Chen
>Priority: Minor
> Fix For: 1.5.0
>
>
> I cannot reproduce this consistently but when submitting jobs just after 
> another finished it will not come up:
> {code}
> 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to 
> akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
> 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with 
> driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler
> java.lang.NullPointerException
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
>   at java.lang.String.valueOf(String.java:2982)
>   at 
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at 
> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org