[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-06 Thread Susan X. Huynh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570550#comment-16570550
 ] 

Susan X. Huynh commented on SPARK-25024:


Yes, I agree, the "Running on Mesos" page could probably use a section devoted 
to running a secure cluster, and the main "Spark Security" page might also need 
a few updates if something is not supported in Mesos. I could take a stab at 
it, although I don't have as much experience running Spark on vanilla Mesos 
(vs. Spark on DC/OS).

> Update mesos documentation to be clear about security supported
> ---
>
> Key: SPARK-25024
> URL: https://issues.apache.org/jira/browse/SPARK-25024
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our mesos deployment docs and security docs and its not 
> clear at all what type of security and how to set it up for mesos.  I think 
> we should clarify this and have something about exactly what is supported and 
> what is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22342) refactor schedulerDriver registration

2018-04-08 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429950#comment-16429950
 ] 

Susan X. Huynh commented on SPARK-22342:


Good news: I found the root cause of the multiple registration bug, and it is 
not a Spark bug. It is caused by a bug in libmesos: "using a failoverTimeout of 
0 with Mesos native scheduler client can result in infinite subscribe loop", 
https://issues.apache.org/jira/browse/MESOS-8171 . This bug leads to the 
multiple SUBSCRIBE calls seen in the driver logs. Upgrading the libmesos bundle 
in my Docker image to a version with this patch fixed the issue. cc [~skonto]

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar";
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764b

[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2018-04-02 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423382#comment-16423382
 ] 

Susan X. Huynh commented on SPARK-19320:


Oh, looks like it does retain the old behavior in that case.

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Major
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2018-04-02 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423002#comment-16423002
 ] 

Susan X. Huynh commented on SPARK-19320:


[~yanji84] What happens if spark.mesos.executor.gpus is not set, but 
spark.mesos.gpus.max is set? Should we retain the old behavior? (Similar to the 
behavior for CPUs – see "executorCoresOption" in the code.)

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Major
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22342) refactor schedulerDriver registration

2018-03-23 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411780#comment-16411780
 ] 

Susan X. Huynh commented on SPARK-22342:


The multiple re-registration issue can lead to blacklisting and starvation when 
there are multiple executors per host. For example, suppose I have a host with 
8 cpu, and I specify spark.executor.cores=4. Then 2 Executors could potentially 
get allocated on that host. If they both receive a TASK_LOST, that host will 
get blacklisted (since MAX_SLAVE_FAILURES=2). If this happens on every host, 
the app will be starved. I have hit this bug a lot when running on large 
machines (16-64 cpus) and specifying a small executor size, 
spark.executor.cores=4.

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar";
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c

[jira] [Commented] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-03-19 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405586#comment-16405586
 ] 

Susan X. Huynh commented on SPARK-23499:


[~pgillet] Here's a different issue: in the proposed design, it is possible for 
some queues to be starved. For example, if the URGENT queue keeps adding 
drivers, the ROUTINE drivers will never get to run, regardless of what the role 
weights are. In YARN, this is addressed by allowing preemption of jobs (as 
described here: 
http://blog.cloudera.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/).

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Pascal GILLET
>Priority: Major
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
> {code:java}
> # Conf on the dispatcher side
> spark.mesos.dispatcher.queue.URGENT=1.0
> # Conf on the driver side
> spark.mesos.dispatcher.queue=URGENT
> spark.mesos.role=URGENT
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-03-14 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398880#comment-16398880
 ] 

Susan X. Huynh commented on SPARK-23499:


To ensure a consistent policy, maybe the queue should always be based on the 
role? Instead of making it the user's responsibility to create the mapping: 
when submitting an application, the user only specifies the role, not the 
queue. Spark would infer the queue. Have you considered designing it this way? 
[~pgillet]

cc [~skonto]

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Pascal GILLET
>Priority: Major
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
> {code:java}
> # Conf on the dispatcher side
> spark.mesos.dispatcher.queue.URGENT=1.0
> # Conf on the driver side
> spark.mesos.dispatcher.queue=URGENT
> spark.mesos.role=URGENT
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-17 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368331#comment-16368331
 ] 

Susan X. Huynh commented on SPARK-23423:


[~skonto] [~igor.berman] I got some info from a coworker:
{noformat}
The agent will generate a terminal update for each task still in a non-terminal 
state when the executor terminates. These are forward through the master (as 
are all agent generated messages for schedulers) and will be delivered 
"reliability" with an acknowledgement needed from the scheduler.
{noformat}
So, to investigate the missing status updates, I would first look in the agent 
logs around the time the executor was killed, and then check if the master 
received the update.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-17 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368261#comment-16368261
 ] 

Susan X. Huynh commented on SPARK-23423:


I'll check if the task updates might be dropped under heavy load. 
[~igor.berman] Normally, you should see the TASK_KILLED updates in the logs, 
something like:

 
{noformat}
15:38:47 INFO TaskSchedulerImpl: Executor 2 on 10.0.1.201 killed by driver.
15:38:47 INFO DAGScheduler: Executor lost: 2 (epoch 0)
15:38:47 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from 
BlockManagerMaster.
15:38:47 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, 10.0.1.201, 42805, None)
15:38:47 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
15:38:47 INFO ExecutorAllocationManager: Existing executor 2 has been removed 
(new total is 1)
15:38:48 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now 
TASK_KILLED
15:38:48 INFO BlockManagerMaster: Removal of executor 2 requested
15:38:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove 
non-existent executor 2
{noformat}
 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22336) When spark-submit cluster mode is run from a Mesos task, the job fails

2017-10-23 Thread Susan X. Huynh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susan X. Huynh updated SPARK-22336:
---
Description: 
In Mesos cluster mode, spark-submit passes all local MESOS_xxx env vars to the 
Dispatcher (via RestSubmissionClient). When these env vars are later set in the 
Driver task, they overwrite existing MESOS_xxx env vars, such as 
MESOS_FRAMEWORK_ID and MESOS_EXECUTOR_ID, that are needed to run correctly on 
MESOS. This makes it impossible to run spark-submit from Chronos, or any Mesos 
task, because the resulting driver will inherit the wrong MESOS_xxx and fail 
right away.
A Mesos issue reporting the same behavior: 
https://github.com/mesos/chronos/issues/707

  was:
In Mesos cluster mode, spark-submit passes all local MESOS_xxx env vars to the 
Dispatcher (via RestSubmissionClient). When these env vars are later set in the 
Driver task, they overwrite existing MESOS_xxx env vars, such as 
MESOS_FRAMEWORK_ID and MESOS_EXECUTOR_ID, that are needed to run correctly on 
MESOS. This makes it impossible to run spark-submit from Chronos, or any Mesos 
task, because the resulting driver will inherit the wrong MESOS_xxx fail right 
away.
A Mesos issue reporting the same behavior: 
https://github.com/mesos/chronos/issues/707


> When spark-submit cluster mode is run from a Mesos task, the job fails
> --
>
> Key: SPARK-22336
> URL: https://issues.apache.org/jira/browse/SPARK-22336
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.2
> Environment: Mesos
>Reporter: Susan X. Huynh
>
> In Mesos cluster mode, spark-submit passes all local MESOS_xxx env vars to 
> the Dispatcher (via RestSubmissionClient). When these env vars are later set 
> in the Driver task, they overwrite existing MESOS_xxx env vars, such as 
> MESOS_FRAMEWORK_ID and MESOS_EXECUTOR_ID, that are needed to run correctly on 
> MESOS. This makes it impossible to run spark-submit from Chronos, or any 
> Mesos task, because the resulting driver will inherit the wrong MESOS_xxx and 
> fail right away.
> A Mesos issue reporting the same behavior: 
> https://github.com/mesos/chronos/issues/707



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22336) When spark-submit cluster mode is run from a Mesos task, the job fails

2017-10-23 Thread Susan X. Huynh (JIRA)
Susan X. Huynh created SPARK-22336:
--

 Summary: When spark-submit cluster mode is run from a Mesos task, 
the job fails
 Key: SPARK-22336
 URL: https://issues.apache.org/jira/browse/SPARK-22336
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.0.2
 Environment: Mesos
Reporter: Susan X. Huynh


In Mesos cluster mode, spark-submit passes all local MESOS_xxx env vars to the 
Dispatcher (via RestSubmissionClient). When these env vars are later set in the 
Driver task, they overwrite existing MESOS_xxx env vars, such as 
MESOS_FRAMEWORK_ID and MESOS_EXECUTOR_ID, that are needed to run correctly on 
MESOS. This makes it impossible to run spark-submit from Chronos, or any Mesos 
task, because the resulting driver will inherit the wrong MESOS_xxx fail right 
away.
A Mesos issue reporting the same behavior: 
https://github.com/mesos/chronos/issues/707



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17419) Mesos virtual network support

2017-08-10 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122019#comment-16122019
 ] 

Susan X. Huynh commented on SPARK-17419:


SPARK-21694 allows the user to pass network labels to CNI plugins.

> Mesos virtual network support
> -
>
> Key: SPARK-17419
> URL: https://issues.apache.org/jira/browse/SPARK-17419
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Michael Gummelt
>
> http://mesos.apache.org/documentation/latest/cni/
> This will enable launching executors into virtual networks for isolation and 
> security. It will also enable container per IP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17419) Mesos virtual network support

2017-08-10 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122018#comment-16122018
 ] 

Susan X. Huynh commented on SPARK-17419:


SPARK-18232 adds the ability to launch containers attached to a CNI network, by 
specifying `--conf spark.mesos.network.name`.

> Mesos virtual network support
> -
>
> Key: SPARK-17419
> URL: https://issues.apache.org/jira/browse/SPARK-17419
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Michael Gummelt
>
> http://mesos.apache.org/documentation/latest/cni/
> This will enable launching executors into virtual networks for isolation and 
> security. It will also enable container per IP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21694) Support Mesos CNI network labels

2017-08-10 Thread Susan X. Huynh (JIRA)
Susan X. Huynh created SPARK-21694:
--

 Summary: Support Mesos CNI network labels
 Key: SPARK-21694
 URL: https://issues.apache.org/jira/browse/SPARK-21694
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Susan X. Huynh
 Fix For: 2.3.0


Background: SPARK-18232 added the ability to launch containers attached to a 
CNI network by specifying the network name via `spark.mesos.network.name`.

This ticket is to allow the user to pass network labels to CNI plugins. More 
details in the related Mesos documentation: 
http://mesos.apache.org/documentation/latest/cni/#mesos-meta-data-to-cni-plugins



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21458) Tear down the framework when failover_timeout > 0 (Mesos cluster mode)

2017-07-18 Thread Susan X. Huynh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susan X. Huynh updated SPARK-21458:
---
Description: 
When the driver failover_timeout was always set to zero, we relied on the Mesos 
master to detect the disconnected driver and tear down the framework. When 
failover_timeout is nonzero, we have to make sure that the driver framework is 
torn down in all cases. Some cases require an explicit teardown are:
# When a driver job submission is killed by the user
# In --supervise mode, when a driver fails

Note: the driver and executors do stop running. The only issue is the the 
framework shows up as "Inactive" rather than "Completed" without the teardown, 
for a period of failover_timeout seconds.

  was:
When the driver failover_timeout was always set to zero, we relied on the Mesos 
master to detect the disconnected driver and tear down the framework. When 
failover_timeout is nonzero, we have to make sure that the driver framework is 
torn down in all cases. Some cases require an explicit teardown are:
# When a driver job submission is killed by the user
# In --supervise mode, when a driver fails

Note: the driver and executors do stop running. The only issue is the the 
framework shows up as "Inactive" rather than "Completed" without the teardown.


> Tear down the framework when failover_timeout > 0 (Mesos cluster mode)
> --
>
> Key: SPARK-21458
> URL: https://issues.apache.org/jira/browse/SPARK-21458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Susan X. Huynh
>
> When the driver failover_timeout was always set to zero, we relied on the 
> Mesos master to detect the disconnected driver and tear down the framework. 
> When failover_timeout is nonzero, we have to make sure that the driver 
> framework is torn down in all cases. Some cases require an explicit teardown 
> are:
> # When a driver job submission is killed by the user
> # In --supervise mode, when a driver fails
> Note: the driver and executors do stop running. The only issue is the the 
> framework shows up as "Inactive" rather than "Completed" without the 
> teardown, for a period of failover_timeout seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21458) Tear down the framework when failover_timeout > 0 (Mesos cluster mode)

2017-07-18 Thread Susan X. Huynh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susan X. Huynh updated SPARK-21458:
---
Description: 
When the driver failover_timeout was always set to zero, we relied on the Mesos 
master to detect the disconnected driver and tear down the framework. When 
failover_timeout is nonzero, we have to make sure that the driver framework is 
torn down in all cases. Some cases require an explicit teardown are:
# When a driver job submission is killed by the user
# In --supervise mode, when a driver fails

Note: the driver and executors do stop running. The only issue is the the 
framework shows up as "Inactive" rather than "Completed" without the teardown.

  was:
When the driver failover_timeout was always set to zero, we relied on the Mesos 
master to detect the disconnected driver and tear down the framework. When 
failover_timeout is nonzero, we have to make sure that the driver framework is 
torn down in all cases. Some cases require an explicit teardown are:
# When a driver job submission is killed by the user
# In --supervise mode, when a driver fails
Note: the driver and executors do stop running. The only issue is the the 
framework shows up as "Inactive" rather than "Completed" without the teardown.


> Tear down the framework when failover_timeout > 0 (Mesos cluster mode)
> --
>
> Key: SPARK-21458
> URL: https://issues.apache.org/jira/browse/SPARK-21458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Susan X. Huynh
>
> When the driver failover_timeout was always set to zero, we relied on the 
> Mesos master to detect the disconnected driver and tear down the framework. 
> When failover_timeout is nonzero, we have to make sure that the driver 
> framework is torn down in all cases. Some cases require an explicit teardown 
> are:
> # When a driver job submission is killed by the user
> # In --supervise mode, when a driver fails
> Note: the driver and executors do stop running. The only issue is the the 
> framework shows up as "Inactive" rather than "Completed" without the teardown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21458) Tear down the framework when failover_timeout > 0 (Mesos cluster mode)

2017-07-18 Thread Susan X. Huynh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susan X. Huynh updated SPARK-21458:
---
Description: 
When the driver failover_timeout was always set to zero, we relied on the Mesos 
master to detect the disconnected driver and tear down the framework. When 
failover_timeout is nonzero, we have to make sure that the driver framework is 
torn down in all cases. Some cases require an explicit teardown are:
# When a driver job submission is killed by the user
# In --supervise mode, when a driver fails
Note: the driver and executors do stop running. The only issue is the the 
framework shows up as "Inactive" rather than "Completed" without the teardown.

  was:
When the driver failover_timeout was always set to zero, we relied on the Mesos 
master to detect the disconnected driver and tear down the framework. When 
failover_timeout is nonzero, we have to make sure that the driver framework is 
torn down in all cases. Some cases require an explicit teardown are:
# When a driver job submission is killed by the user
# In --supervise mode, when a driver fails


> Tear down the framework when failover_timeout > 0 (Mesos cluster mode)
> --
>
> Key: SPARK-21458
> URL: https://issues.apache.org/jira/browse/SPARK-21458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Susan X. Huynh
>
> When the driver failover_timeout was always set to zero, we relied on the 
> Mesos master to detect the disconnected driver and tear down the framework. 
> When failover_timeout is nonzero, we have to make sure that the driver 
> framework is torn down in all cases. Some cases require an explicit teardown 
> are:
> # When a driver job submission is killed by the user
> # In --supervise mode, when a driver fails
> Note: the driver and executors do stop running. The only issue is the the 
> framework shows up as "Inactive" rather than "Completed" without the teardown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21419) Support Mesos failover_timeout in driver (Mesos cluster mode)

2017-07-18 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091657#comment-16091657
 ] 

Susan X. Huynh commented on SPARK-21419:


I split this into two sub-tasks: (1) making the failover_timeout configurable 
and (2) adding an explicit teardown to cases where we currently rely on the 
master to timeout immediately and do the teardown.

> Support Mesos failover_timeout in driver (Mesos cluster mode)
> -
>
> Key: SPARK-21419
> URL: https://issues.apache.org/jira/browse/SPARK-21419
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Susan X. Huynh
>
> In Mesos cluster mode, the driver framework's failover_timeout is currently 
> set to zero. This means that if the driver temporarily loses connectivity 
> with the master, the driver is considered disconnected, and the master will 
> immediately kill all tasks and executors associated with the framework.
> To avoid this behavior, I would like to make this failover_timeout 
> configurable. A user could then set it to a non-zero value, so that during a 
> temporary disconnection the master would wait before tearing down the 
> framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21458) Tear down the framework when failover_timeout > 0 (Mesos cluster mode)

2017-07-18 Thread Susan X. Huynh (JIRA)
Susan X. Huynh created SPARK-21458:
--

 Summary: Tear down the framework when failover_timeout > 0 (Mesos 
cluster mode)
 Key: SPARK-21458
 URL: https://issues.apache.org/jira/browse/SPARK-21458
 Project: Spark
  Issue Type: Sub-task
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Susan X. Huynh


When the driver failover_timeout was always set to zero, we relied on the Mesos 
master to detect the disconnected driver and tear down the framework. When 
failover_timeout is nonzero, we have to make sure that the driver framework is 
torn down in all cases. Some cases require an explicit teardown are:
# When a driver job submission is killed by the user
# In --supervise mode, when a driver fails



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21456) Make the driver failover_timeout configurable (Mesos cluster mode)

2017-07-18 Thread Susan X. Huynh (JIRA)
Susan X. Huynh created SPARK-21456:
--

 Summary: Make the driver failover_timeout configurable (Mesos 
cluster mode)
 Key: SPARK-21456
 URL: https://issues.apache.org/jira/browse/SPARK-21456
 Project: Spark
  Issue Type: Sub-task
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Susan X. Huynh


Instead of setting the driver framework failover_timeout to zero, the 
failover_timeout will be configurable via a spark-submit config option. The 
default value will still be zero if the configuration is not set.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21419) Support Mesos failover_timeout in driver (Mesos cluster mode)

2017-07-14 Thread Susan X. Huynh (JIRA)
Susan X. Huynh created SPARK-21419:
--

 Summary: Support Mesos failover_timeout in driver (Mesos cluster 
mode)
 Key: SPARK-21419
 URL: https://issues.apache.org/jira/browse/SPARK-21419
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Susan X. Huynh


In Mesos cluster mode, the driver framework's failover_timeout is currently set 
to zero. This means that if the driver temporarily loses connectivity with the 
master, the driver is considered disconnected, and the master will immediately 
kill all tasks and executors associated with the framework.

To avoid this behavior, I would like to make this failover_timeout 
configurable. A user could then set it to a non-zero value, so that during a 
temporary disconnection the master would wait before tearing down the framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17966) Support Spark packages with R code on Mesos

2016-11-29 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15705635#comment-15705635
 ] 

Susan X. Huynh commented on SPARK-17966:


(Same question here) I'm not sure how to test this issue. Could you provide 
some details about testing this fix, or point me to documentation about it?
[~sunrui] [~felixcheung]

> Support Spark packages with R code on Mesos
> ---
>
> Key: SPARK-17966
> URL: https://issues.apache.org/jira/browse/SPARK-17966
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17968) Support using 3rd-party R packages on Mesos

2016-11-29 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15705629#comment-15705629
 ] 

Susan X. Huynh commented on SPARK-17968:


I'm not sure how to test this issue. Could you provide some details about 
testing this fix, or point me to documentation about it?
[~sunrui] [~felixcheung]

> Support using 3rd-party R packages on Mesos
> ---
>
> Key: SPARK-17968
> URL: https://issues.apache.org/jira/browse/SPARK-17968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17965) Enable SparkR with Mesos cluster mode

2016-11-29 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15705619#comment-15705619
 ] 

Susan X. Huynh commented on SPARK-17965:


This issue was fixed at the same time as SPARK-17964:
https://github.com/apache/spark/pull/15700

> Enable SparkR with Mesos cluster mode
> -
>
> Key: SPARK-17965
> URL: https://issues.apache.org/jira/browse/SPARK-17965
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster

2016-10-16 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581023#comment-15581023
 ] 

Susan X. Huynh commented on SPARK-11524:


Thanks for the advice and for breaking down the sub-issues. For mesos _cluster_ 
mode, is there additional work? Or is it just the same problem to locate the 
sparkr.zip at slave nodes?

> Support SparkR with Mesos cluster
> -
>
> Key: SPARK-11524
> URL: https://issues.apache.org/jira/browse/SPARK-11524
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster

2016-10-15 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578137#comment-15578137
 ] 

Susan X. Huynh commented on SPARK-11524:


I would like to work on this. What are the missing pieces? What configurations 
need to be tested? [~sunrui][~felixcheung]

> Support SparkR with Mesos cluster
> -
>
> Key: SPARK-11524
> URL: https://issues.apache.org/jira/browse/SPARK-11524
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org