[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-20 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370114#comment-16370114
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] I've investigated failures a bit more so found 2 reasons in our case: 
OOM errors that happen sporadically(not sure why exactly),  jdwp port 
collision(probably mesos tried to start 2nd executor on same mesos slave for 
some application)

We statically configure for every application jdwp port so that we can debug if 
needed executor's codeprobably this is main reason in our case, maybe I 
need to disable permitting to start >1 executors on every slave for some 
application

 

I'll close this Jira as duplicate in favor to 19755. I've also prepared PR to 
fix 19755 see https://github.com/apache/spark/pull/20640

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370074#comment-16370074
 ] 

Stavros Kontopoulos commented on SPARK-23423:
-

[~igor.berman] Thnx for supplying the info. We are aware of SPARK-19755 
actually we want to fix this for quite some time. The hardcoded value is not 
the way to go for sure. One note here though.. try to avoid also the failure of 
the executors in the first place, even if you tolerate more failures they 
shouldnt fail. You can allocate different port ranges AFIK for your tasks or by 
default ports are random from what I recall. I am still curious how they 
fail... maybe there is an issue there as well... 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-19 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369221#comment-16369221
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] so I've run application today with relevant logs at debug level. 
Seems like with dynamic allocation on with executors starting and shutting down 
the chances that every slave will get 2 failures staring some executor are much 
higher that in regular case(without dynamic allocation)  seems like SPARK-19755 
is the core issue here - after half day of long running driver in client mode 
almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted.

the reasons for executor failures might be different and transient(e.g. port 
collision)

I think I'll close this Jira as duplicate for SPARK-19755, WDYT?

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
> Attachments: no-dyn-allocation-failed-no-statusUpdate.txt
>
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-18 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368594#comment-16368594
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] today I haven't managed to run with dynamic allocation on, however 
attached details for following situation:

one of executors failed while running without dyn.allocation. All parties seem 
like getting all the updates: slave agent, master, and even framework, but not  
MesosCoarseGrainedSchedulerBackend even though TaskSchedulerImpl did get "Lost 
executor 15"

see [^no-dyn-allocation-failed-no-statusUpdate.txt]

 

I'll enable dyn.allocation tomorrow. Meanwhile, do you think the attached 
behavior might be problematic?

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
> Attachments: no-dyn-allocation-failed-no-statusUpdate.txt
>
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-17 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368331#comment-16368331
 ] 

Susan X. Huynh commented on SPARK-23423:


[~skonto] [~igor.berman] I got some info from a coworker:
{noformat}
The agent will generate a terminal update for each task still in a non-terminal 
state when the executor terminates. These are forward through the master (as 
are all agent generated messages for schedulers) and will be delivered 
"reliability" with an acknowledgement needed from the scheduler.
{noformat}
So, to investigate the missing status updates, I would first look in the agent 
logs around the time the executor was killed, and then check if the master 
received the update.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-17 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368261#comment-16368261
 ] 

Susan X. Huynh commented on SPARK-23423:


I'll check if the task updates might be dropped under heavy load. 
[~igor.berman] Normally, you should see the TASK_KILLED updates in the logs, 
something like:

 
{noformat}
15:38:47 INFO TaskSchedulerImpl: Executor 2 on 10.0.1.201 killed by driver.
15:38:47 INFO DAGScheduler: Executor lost: 2 (epoch 0)
15:38:47 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from 
BlockManagerMaster.
15:38:47 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, 10.0.1.201, 42805, None)
15:38:47 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
15:38:47 INFO ExecutorAllocationManager: Existing executor 2 has been removed 
(new total is 1)
15:38:48 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now 
TASK_KILLED
15:38:48 INFO BlockManagerMaster: Removal of executor 2 requested
15:38:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove 
non-existent executor 2
{noformat}
 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366162#comment-16366162
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] I'll at Sunday probably

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365468#comment-16365468
 ] 

Stavros Kontopoulos commented on SPARK-23423:
-

The task updates delivery is a responsibility of mesos master AFIK. Btw you 
should see the status of tasks on the agent log (assuming you are not loosing 
agents right?) where the mesos executor was running on. [~susanxhuynh] is it 
possible that under heavy load the task updates are not delivered?

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365438#comment-16365438
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] do you think it's could be connected to 
https://issues.apache.org/jira/browse/SPARK-21460 ? The Jira is connected to 
Yarn, but seems like it might be same cause

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365436#comment-16365436
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto], yes this is correct. Besides TASK_RUNNING reports(many) there are 
only 2 TASK_FAILURE reports present(there are no FINISHED, LOST tasks)
We have cluster with approx 100 machines and for this specific framework many 
executors were running (40 or so), many of them were killed during scale 
down(>> 2). I can see in mesos ui that there are completed tasks with KILLED 
state for this application.

the question is where all other reports? Can you think of any reason they won't 
be reported or driver won't register it?

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365295#comment-16365295
 ] 

Stavros Kontopoulos commented on SPARK-23423:
-

[~igor.berman] From the log I see that the scheduler received two task updates 
with state TASK_FAILURE, is that correct?

The executor removal is triggered if the status of the task received belongs to 
any of the following states: 

FINISHED, FAILED, KILLED, LOST.

[https://github.com/apache/spark/blob/ed8647609883fcef16be5d24c2cb4ebda25bd6f0/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L640]

[https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/TaskState.scala#L22]

 

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365194#comment-16365194
 ] 

Igor Berman commented on SPARK-23423:
-

Hi [~skonto], thanks for your response!

Yes you are right, the slaves are not removed, but taskIds are. However in my 
case(maybe connected to mesos version) I see that the only updates I'm getting 
for task statuses are fro TASK_RUNNING but not for TASK_KILLED primary

I'm referencing this line:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L609]

here is grep:
{code:java}
[root@node mycomp]# zgrep "is now" /var/log/mycomp/my-app.*.log.gz | grep -v 
TASK_RUNNING
/var/log/mycomp/my-app.12.log.gz:2018-02-12 15:01:31,329 INFO [Thread-56] 
MesosCoarseGrainedSchedulerBackend [] - Mesos task 17 is now TASK_FAILED
/var/log/mycomp/my-app.8.log.gz:2018-02-05 23:52:16,534 INFO [Thread-62] 
MesosCoarseGrainedSchedulerBackend [] - Mesos task 32 is now TASK_FAILED{code}
 

so my thought is that those updates that cause taskIds to be removed(when 
executor is killed due to scaling down) are somehow lost, unless I'm missing 
something...

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364861#comment-16364861
 ] 

Stavros Kontopoulos commented on SPARK-23423:
-

Hi [~igor.berman]. Looking at the code again I think when there is a status 
update tasksIds of dead tasks are removed:

[https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L732]

Slaves are not removed but task Ids are, maybe something else is not working. 
Do you have a log at the time of the issue to attach?

The test you have is ok but does not trigger deletion for the tasks in the case 
of a failure.

Btw the behavior for checking the upper limit of the num of the executors you 
are referring to was added here: spark-16944.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org