[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117241#comment-16117241
 ] 

Thomas Graves commented on SPARK-21656:
---

As a said above it DOES help the application to keep them alive. the scheduler 
logic will fall back to them at some point when it goes to rack/any locality or 
when it finishes the tasks that are getting locality on those few nodes.  Thus 
why I'm saying its a conflict within spark. 

SPARK should be resilient to any weird things happening.  In the cases I have 
described we could actually release all of our executors and never ask for more 
within a stage, that is a BUG.   We can change the configs to make it so that 
doesn't normally happen but a user could change them back and when they do that 
it shouldn't result in a deadlock.



> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117217#comment-16117217
 ] 

Sean Owen commented on SPARK-21656:
---

I do not understand what the bug is. Configuration says an executor should go 
away if idle for X seconds. Configuration leads tasks to schedule on other 
executors for X seconds. It is correct that it is removed. You are claiming 
that it would help the application, but, the application is not scheduling 
anything on the executor. It does not help the app to keep it alive. Right? 
this seems obvious, so we must be talking about something different. You're 
talking about a bunch of other logic but what would it be based on? all of the 
data it has says the executor will be unused, indefinitely.

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117204#comment-16117204
 ] 

Thomas Graves commented on SPARK-21656:
---

Another option would be just to add logic for spark to look at some point to 
see if it should try reacquiring some. All of that though seems like more logic 
then just not letting them go.  To me Spark needs to be more resilient about 
this and should handle various possible conditions.  User shouldn't have to 
tune every single job to account for weird things happening.  Note that if 
dynamic allocation is off this doesn't happen. So why is user getting worse 
experience in this case.

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117200#comment-16117200
 ] 

Thomas Graves commented on SPARK-21656:
---

why not fix the bug in dynamic allocation?  changing configs is a work around.  
like everything else what are the best configs for everyone's job.  

dynamic allocation is supposed to get you enough executors to run all your 
tasks in parallel (up to your config limits).  This is not allowing that and 
its code within SPARK that is doing it, not user code. Thus a bug in my opinion.

The documentation even hints at it. The problem is we just didn't catch this 
issue that in the initial code.

From:
http://spark.apache.org/docs/2.2.0/job-scheduling.html#remove-policy

"in that an executor should not be idle if there are still pending tasks to be 
scheduled"

One other option here would be to actually let them go and get new ones. This 
may or may not help depending on if it can get ones with better locality.  it 
might also just waste time releasing and reacquiring.

I personally would also be ok with changing the locality wait for node to 0 
which generally works around the problem, but I think this could happen in 
other cases and we should fix this bug too.  For instance say your driver does 
a full GC and can't schedule things within 60 seconds, you lose those executors 
and we never get them back.   What if you have temporary network congestion and 
your network timeout is plenty big to allow for, you could idle timeout.  yes 
we could increase the idle timeout, but in the normal working case the idle 
timeout is meant to be cases where you don't have any tasks to run on this 
executor.  Your stage has completed enough you can release some. This is not 
that case.

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117167#comment-16117167
 ] 

Sean Owen commented on SPARK-21656:
---

If the issue is "given more time" then increase the idle timeout? or indeed the 
locality settings. Why does this need another configuration? It sounds like 
it's at best a change to defaults, but, how about start by having the app care 
less about locality? It doesn't make sense to say that executors that are by 
definition not needed according to a user's config should not be reclaimed 
because the config is wrong.

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117159#comment-16117159
 ] 

Thomas Graves commented on SPARK-21656:
---

If given more time the scheduler would have fallen back to use those for rack 
local or any locality.Yes you can get around this by changing the locality 
settings (which is what the work around is) but I don't think that is what 
should happen.  Its 2 features that are conflicting with timeouts. And it is 
the defaults we ship with causing bad things to happen. I do think we should 
look at the locality logic in the scheduler more to see if there is anything to 
improve there but I haven't had time to do that.

The thing is that dynamic allocation never gets more executors for the same 
stage once its  acquired them and let them idle timeout. So if you get some 
weird situations you end up just having very few executors to run thousands of 
tasks.  In my opinion its better to hold those executors and let the normal 
scheduler logic work.  

We can add a config flag for this if needed if people would like this behavior 
but I think that conflict with the scheduler logic.

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117122#comment-16117122
 ] 

Sean Owen commented on SPARK-21656:
---

Good point. In that case, what's wrong with killing the executor? if the 
scheduler is consistently preferring locality enough to let those executors go 
idle -- either those settings are wrong or those executors aren't needed. 
What's the argument that the app needs them if no tasks are scheduling?

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117086#comment-16117086
 ] 

Thomas Graves commented on SPARK-21656:
---

The executor can be idle if the scheduler doesn't put any tasks on it. The 
scheduler can skip executors due to the locality settings 
(spark.locality.wait.node).  We have seen this many times now where it gets in 
this harmonic where some executors get node locality and other don't.  The 
scheduler skips many of the executors that don't get locality and eventually 
they idle timeout when there are 10's of thousands of tasks left. 
We generally see this with very large jobs that have like 1000 executors, 
15 map tasks.

We shouldn't allow them to idle timeout if we still need them. 

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116975#comment-16116975
 ] 

Sean Owen commented on SPARK-21656:
---

I don't see how an executor would be idle if there is a task to run, unless of 
course you changed the locality settings a lot. There's no real detail here 
that would establish a problem in Spark. 

> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>Priority: Minor
> Fix For: 2.1.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org