[jira] [Commented] (MESOS-7271) JNI SIGSEGV failed when connecting spark to mesos master
[ https://issues.apache.org/jira/browse/MESOS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016089#comment-16016089 ] Michael Gummelt commented on MESOS-7271: No, Oracle 1.8.0_112 > JNI SIGSEGV failed when connecting spark to mesos master > > > Key: MESOS-7271 > URL: https://issues.apache.org/jira/browse/MESOS-7271 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 1.1.0, 1.2.0 > Environment: Ubuntu 16.04, OpenJDK 8, Spark 2.1.1 >Reporter: Qi Cui > > Run starting. Expected test count is: 1 > SampleDataFrameTest: > 17/03/20 11:53:16 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0320 11:53:19.775842 4679 process.cpp:1071] libprocess is initialized on > 192.168.0.99:38293 with 8 worker threads > I0320 11:53:19.775975 4679 logging.cpp:199] Logging to STDERR > I0320 11:53:19.789871 4725 sched.cpp:226] Version: 1.1.0 > I0320 11:53:19.832826 4717 sched.cpp:330] New master detected at > master@192.168.0.50:5050 > I0320 11:53:19.838253 4717 sched.cpp:341] No credentials provided. > Attempting to register without authentication > I0320 11:53:19.838337 4717 sched.cpp:820] Sending SUBSCRIBE call to > master@192.168.0.50:5050 > I0320 11:53:19.840265 4717 sched.cpp:853] Will retry registration in > 32.354951ms if necessary > I0320 11:53:19.844734 4717 sched.cpp:743] Framework registered with > 6e147824-5d88-411b-9c09-a7137565c309-0001 > I0320 11:53:19.864850 4717 sched.cpp:757] Scheduler::registered took > 20.022604ms > ERROR: exception pending on entry to FindMesosClass() > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7ffa06fea4a6, pid=4677, tid=0x7ff9a1a46700 > # > # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build > 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x6744a6] > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /media/sf_G_DRIVE/src/spark-testing-base/hs_err_pid4677.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7271) JNI SIGSEGV failed when connecting spark to mesos master
[ https://issues.apache.org/jira/browse/MESOS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940744#comment-15940744 ] Michael Gummelt commented on MESOS-7271: I don't know, but I've been running Spark 2.1 against Mesos 1.2 w/o any problems, so I can't repro this. > JNI SIGSEGV failed when connecting spark to mesos master > > > Key: MESOS-7271 > URL: https://issues.apache.org/jira/browse/MESOS-7271 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 1.1.0, 1.2.0 > Environment: Ubuntu 16.04, OpenJDK 8, Spark 2.1.1 >Reporter: Qi Cui > > Run starting. Expected test count is: 1 > SampleDataFrameTest: > 17/03/20 11:53:16 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0320 11:53:19.775842 4679 process.cpp:1071] libprocess is initialized on > 192.168.0.99:38293 with 8 worker threads > I0320 11:53:19.775975 4679 logging.cpp:199] Logging to STDERR > I0320 11:53:19.789871 4725 sched.cpp:226] Version: 1.1.0 > I0320 11:53:19.832826 4717 sched.cpp:330] New master detected at > master@192.168.0.50:5050 > I0320 11:53:19.838253 4717 sched.cpp:341] No credentials provided. > Attempting to register without authentication > I0320 11:53:19.838337 4717 sched.cpp:820] Sending SUBSCRIBE call to > master@192.168.0.50:5050 > I0320 11:53:19.840265 4717 sched.cpp:853] Will retry registration in > 32.354951ms if necessary > I0320 11:53:19.844734 4717 sched.cpp:743] Framework registered with > 6e147824-5d88-411b-9c09-a7137565c309-0001 > I0320 11:53:19.864850 4717 sched.cpp:757] Scheduler::registered took > 20.022604ms > ERROR: exception pending on entry to FindMesosClass() > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7ffa06fea4a6, pid=4677, tid=0x7ff9a1a46700 > # > # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build > 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x6744a6] > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /media/sf_G_DRIVE/src/spark-testing-base/hs_err_pid4677.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-6875) Copy backend fails to copy container
Michael Gummelt created MESOS-6875: -- Summary: Copy backend fails to copy container Key: MESOS-6875 URL: https://issues.apache.org/jira/browse/MESOS-6875 Project: Mesos Issue Type: Bug Components: agent, containerization Affects Versions: 1.1.0 Reporter: Michael Gummelt cc [~gilbert] I get the following error when trying to launch a custom executor in mgummelt/couchbase:latest (which is just ubuntu:14.04 with {{erl}} installed). {code} E0106 19:43:18.759450 3597 slave.cpp:4562] Container 'c1958040-3ca0-4d46-ab32-0c307919be9b' for executor 'server__5cebe7d5-28c3-465c-a442-0ecd49364e62' of framework dbf21cd6-e559-45cf-a159-704aa10d2482-0002 failed to start: Collect failed: Failed to copy layer: cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Africa/Lusaka': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Africa/Mbabane': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/America/Curacao': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Asia/Katmandu': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Asia/Kuwait': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Asia/Thimphu': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Asia/Urumqi': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Atlantic/St_Helena': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Australia/Lord_Howe': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Australia/North': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Australia/Sydney': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Australia/Tasmania': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Pacific/Easter': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Pacific/Saipan': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/Zulu': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/right/Africa/Lusaka': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/right/Africa/Mbabane': Too many levels of symbolic links cp: cannot stat '/var/lib/mesos/slave/provisioner/containers/c1958040-3ca0-4d46-ab32-0c307919be9b/backends/copy/rootfses/e838669b-c728-4609-961e-218584210909/usr/share/zoneinfo/right/America/Curacao': Too many levels of symbolic
[jira] [Created] (MESOS-6874) Agent silently ignores FS isolation when protobuf is malformed
Michael Gummelt created MESOS-6874: -- Summary: Agent silently ignores FS isolation when protobuf is malformed Key: MESOS-6874 URL: https://issues.apache.org/jira/browse/MESOS-6874 Project: Mesos Issue Type: Bug Affects Versions: 1.1.0 Reporter: Michael Gummelt cc [~vinodkone] I accidentally set my Mesos ContainerInfo to include a DockerInfo instead of a MesosInfo: {code} executorInfoBuilder.setContainer( Protos.ContainerInfo.newBuilder() .setType(Protos.ContainerInfo.Type.MESOS) .setDocker(Protos.ContainerInfo.DockerInfo.newBuilder() .setImage(podSpec.getContainer().get().getImageName())) {code} I would have expected a validation error before or during containerization, but instead, the agent silently decided to ignore filesystem isolation altogether, and launch my executor on the host filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6754) Include command in task's state.json entry
Michael Gummelt created MESOS-6754: -- Summary: Include command in task's state.json entry Key: MESOS-6754 URL: https://issues.apache.org/jira/browse/MESOS-6754 Project: Mesos Issue Type: Improvement Components: master Reporter: Michael Gummelt I often would like to determine which command a task is running w/o having to SSH into the box and {{ps}}. I'm currently doing this for HDFS, for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6113) Offer Quota resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15459148#comment-15459148 ] Michael Gummelt commented on MESOS-6113: Maybe. It's not clear to me from the title nor the description that MESOS-4392 is proposing to mark quota as revocable. > Offer Quota resources as revocable > -- > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6113) Offer reserved resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15456102#comment-15456102 ] Michael Gummelt commented on MESOS-6113: I tend to think of quota as just another interface to marking resources as reserved, but I understand there are some differences, yes. > Offer reserved resources as revocable > - > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15456099#comment-15456099 ] Michael Gummelt commented on MESOS-6112: That's a fine workaround, yea. I can solve my immediate problem. The purpose of this JIRA is more to make it so this kind of cooperative scheduling is unnecessary. > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453107#comment-15453107 ] Michael Gummelt commented on MESOS-6112: It is a duplicate, yes, but the quota workaround doesn't work for all cases (including mine). If I statically partition my starved framework so that it always has enough resources, this prevents other frameworks from using the slack, which is desirable. I just don't want a framework ranked high on DRF to starve that framework even when it's *not* using those resources. > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6113) Offer reserved resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-6113: --- Description: *Goal:* I have high-priority Spark jobs, and best-effort jobs. I need my high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the best-effort jobs on revocable resources. *Problem:* Revocable resources are currently only created via oversubscription, where resources allocated to but not used by a framework will be offered to other frameworks. This doesn't support the ability for a high-pri framework to start up and pre-empty a low-pri framework. *Solution:* Let's allow quota (and ideally any reserved resources) to be configurable to be offered as revocable resources to other frameworks that don't register with the role. was: *Goal:* I have high-priority Spark jobs, and best-effort jobs. I need my high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the best-effort jobs on revocable resources. *Problem:* Revocable resources are currently only created via oversubscription, where resources allocated to but not used by a framework will be offered to other frameworks. This doesn't support the ability for a high-pri framework to start up and pre-empty a low-pri framework. *Solution:* Let's allow quota to be configured to be offered as revocable resources to other frameworks that don't register with the role. > Offer reserved resources as revocable > - > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota (and ideally any reserved resources) to be configurable to > be offered as revocable resources to other frameworks that don't register > with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6113) Offer reserved resources as revocable
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-6113: --- Summary: Offer reserved resources as revocable (was: Revocable resources for quota) > Offer reserved resources as revocable > - > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota to be configured to be offered as revocable resources to > other frameworks that don't register with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6113) Revocable resources for quota
[ https://issues.apache.org/jira/browse/MESOS-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453042#comment-15453042 ] Michael Gummelt commented on MESOS-6113: cc [~clambert] > Revocable resources for quota > - > > Key: MESOS-6113 > URL: https://issues.apache.org/jira/browse/MESOS-6113 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > *Goal:* > I have high-priority Spark jobs, and best-effort jobs. I need my > high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the > best-effort jobs on revocable resources. > *Problem:* > Revocable resources are currently only created via oversubscription, where > resources allocated to but not used by a framework will be offered to other > frameworks. This doesn't support the ability for a high-pri framework to > start up and pre-empty a low-pri framework. > *Solution:* > Let's allow quota to be configured to be offered as revocable resources to > other frameworks that don't register with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6113) Revocable resources for quota
Michael Gummelt created MESOS-6113: -- Summary: Revocable resources for quota Key: MESOS-6113 URL: https://issues.apache.org/jira/browse/MESOS-6113 Project: Mesos Issue Type: Task Components: allocation Affects Versions: 1.0.1 Reporter: Michael Gummelt *Goal:* I have high-priority Spark jobs, and best-effort jobs. I need my high-priority jobs to pre-empt my best-effort jobs, so I'd like to launch the best-effort jobs on revocable resources. *Problem:* Revocable resources are currently only created via oversubscription, where resources allocated to but not used by a framework will be offered to other frameworks. This doesn't support the ability for a high-pri framework to start up and pre-empty a low-pri framework. *Solution:* Let's allow quota to be configured to be offered as revocable resources to other frameworks that don't register with the role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452826#comment-15452826 ] Michael Gummelt edited comment on MESOS-6112 at 8/31/16 5:38 PM: - {quote} When a framework declines an offer for 5s it says "I don't need these particular resources for the next 5s". {quote} Sort of. My scheduler (e.g. Kafka) is really saying "I don't need these particular resources right now. I don't know when I may need them in the future. Here's a timeout that represents some tradeoff I've determined between latency and good citizenship (fairness)." {quote} Or, even better, call suppressOffers()? Is it hard to understand / implement? {quote} I can call {{suppressOffers()}}. You're right, it's not that hard. But it only partially solves the problem. There will still exist practically unbounded periods of time when I can't suppress. For example, when one of my data nodes fails, I'll try to wait until its persistent volume is offered back to me. But the larger issue is that solutions such as this require all frameworks to be good citizens, which is brittle and unscalable. was (Author: mgummelt): > When a framework declines an offer for 5s it says "I don't need these > particular resources for the next 5s". Sort of. My scheduler (e.g. Kafka) is really saying "I don't need these particular resources right now. I don't know when I may need them in the future. Here's a timeout that represents some tradeoff I've determined between latency and good citizenship (fairness)." > Or, even better, call suppressOffers()? Is it hard to understand / implement? I can call {{suppressOffers()}}. You're right, it's not that hard. But it only partially solves the problem. There will still exist practically unbounded periods of time when I can't suppress. For example, when one of my data nodes fails, I'll try to wait until its persistent volume is offered back to me. But the larger issue is that solutions such as this require all frameworks to be good citizens, which is brittle and unscalable. > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-6112: --- Comment: was deleted (was: > When a framework declines an offer for 5s it says "I don't need these particular resources for the next 5s". Sort of. My scheduler (e.g. Kafka) is really saying "I don't need these particular resources right now. I don't know when I may need them in the future. Here's a timeout that represents some tradeoff I've determined between latency and good citizenship (fairness)." > Or, even better, call suppressOffers()? Is it hard to understand / implement? I can call {{suppressOffers()}}. You're right, it's not that hard. But it only partially solves the problem. There will still exist practically unbounded periods of time when I can't suppress. For example, when one of my data nodes fails, I'll try to wait until its persistent volume is offered back to me. But the larger issue is that solutions such as this require all frameworks to be good citizens, which is brittle and unscalable. ) > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452827#comment-15452827 ] Michael Gummelt commented on MESOS-6112: > When a framework declines an offer for 5s it says "I don't need these > particular resources for the next 5s". Sort of. My scheduler (e.g. Kafka) is really saying "I don't need these particular resources right now. I don't know when I may need them in the future. Here's a timeout that represents some tradeoff I've determined between latency and good citizenship (fairness)." > Or, even better, call suppressOffers()? Is it hard to understand / implement? I can call {{suppressOffers()}}. You're right, it's not that hard. But it only partially solves the problem. There will still exist practically unbounded periods of time when I can't suppress. For example, when one of my data nodes fails, I'll try to wait until its persistent volume is offered back to me. But the larger issue is that solutions such as this require all frameworks to be good citizens, which is brittle and unscalable. > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452826#comment-15452826 ] Michael Gummelt commented on MESOS-6112: > When a framework declines an offer for 5s it says "I don't need these > particular resources for the next 5s". Sort of. My scheduler (e.g. Kafka) is really saying "I don't need these particular resources right now. I don't know when I may need them in the future. Here's a timeout that represents some tradeoff I've determined between latency and good citizenship (fairness)." > Or, even better, call suppressOffers()? Is it hard to understand / implement? I can call {{suppressOffers()}}. You're right, it's not that hard. But it only partially solves the problem. There will still exist practically unbounded periods of time when I can't suppress. For example, when one of my data nodes fails, I'll try to wait until its persistent volume is offered back to me. But the larger issue is that solutions such as this require all frameworks to be good citizens, which is brittle and unscalable. > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-6112: --- Component/s: allocation > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: allocation, master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
[ https://issues.apache.org/jira/browse/MESOS-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450578#comment-15450578 ] Michael Gummelt commented on MESOS-6112: cc [~gabriel.hartm...@gmail.com] [~clambert] > Frameworks are starved when > 5 are run concurrently > > > Key: MESOS-6112 > URL: https://issues.apache.org/jira/browse/MESOS-6112 > Project: Mesos > Issue Type: Task > Components: master >Affects Versions: 1.0.1 >Reporter: Michael Gummelt > > As I understand it, the master will send an offer to a list of frameworks > ordered by DRF, until the offer is accepted. There is a 1s wait time between > each offering. Once the decline timeout for the first framework has been > reached, rather than continuing to submit the offer to the rest of the > frameworks in the list, the master starts over at the beginning, starving the > rest of the frameworks. > This means that in order for Mesos to support > 5 concurrent frameworks, all > frameworks must be good citizens and set their decline timeout to something > large or suppress offers. I think this is a fairly undesirable state of > things. > I propose that the master instead continues to submit the offer to every > registered framework, even if the declineOffer timeout has been reached. > The potential increase in task startup latency that could be introduced by > this change can be obviated in part if we also make the master smarter about > how long to wait between successive offers, rather than a static 1s. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6112) Frameworks are starved when > 5 are run concurrently
Michael Gummelt created MESOS-6112: -- Summary: Frameworks are starved when > 5 are run concurrently Key: MESOS-6112 URL: https://issues.apache.org/jira/browse/MESOS-6112 Project: Mesos Issue Type: Task Components: master Affects Versions: 1.0.1 Reporter: Michael Gummelt As I understand it, the master will send an offer to a list of frameworks ordered by DRF, until the offer is accepted. There is a 1s wait time between each offering. Once the decline timeout for the first framework has been reached, rather than continuing to submit the offer to the rest of the frameworks in the list, the master starts over at the beginning, starving the rest of the frameworks. This means that in order for Mesos to support > 5 concurrent frameworks, all frameworks must be good citizens and set their decline timeout to something large or suppress offers. I think this is a fairly undesirable state of things. I propose that the master instead continues to submit the offer to every registered framework, even if the declineOffer timeout has been reached. The potential increase in task startup latency that could be introduced by this change can be obviated in part if we also make the master smarter about how long to wait between successive offers, rather than a static 1s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6111) Offer cycle is undocumented
Michael Gummelt created MESOS-6111: -- Summary: Offer cycle is undocumented Key: MESOS-6111 URL: https://issues.apache.org/jira/browse/MESOS-6111 Project: Mesos Issue Type: Task Components: documentation Affects Versions: 1.0.1 Reporter: Michael Gummelt cc [~neilc] AFAICT, the "offer cycle" in Mesos is undocumented. As it has been explained to me, the master will send on offer to a successive list of frameworks ordered by DRF, with a 1s gap in between each offer. And when the decline timeout (default 5s) is reached, it will start over at the beginning of the list. This means that, by default, all other frameworks other than the first 5 in DRF ordering will be starved. I'm going to submit a separate JIRA with a proposal to fix this, but at the very least, we should document the above behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6030) Offer API
Michael Gummelt created MESOS-6030: -- Summary: Offer API Key: MESOS-6030 URL: https://issues.apache.org/jira/browse/MESOS-6030 Project: Mesos Issue Type: Improvement Components: master Affects Versions: 1.0.0 Reporter: Michael Gummelt It's often difficult to debug a framework without knowing what it's being offered. The scheduler can log the offers, but not all schedulers do so, and it's often behind a verbose logging option that can be difficult to enable in certain environments. It would be much more helpful if Mesos offered and API for clients to view recent offers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5998) FINISHED task shown as Active in the UI
[ https://issues.apache.org/jira/browse/MESOS-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410056#comment-15410056 ] Michael Gummelt commented on MESOS-5998: http://mgummelt-mesos.s3.amazonaws.com/ui_screenshot.png > FINISHED task shown as Active in the UI > --- > > Key: MESOS-5998 > URL: https://issues.apache.org/jira/browse/MESOS-5998 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Michael Gummelt > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5998) FINISHED task shown as Active in the UI
[ https://issues.apache.org/jira/browse/MESOS-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-5998: --- Comment: was deleted (was: http://mgummelt-mesos.s3.amazonaws.com/ui_screenshot.png) > FINISHED task shown as Active in the UI > --- > > Key: MESOS-5998 > URL: https://issues.apache.org/jira/browse/MESOS-5998 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Michael Gummelt > > http://mgummelt-mesos.s3.amazonaws.com/ui_screenshot.png -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5998) FINISHED task shown as Active in the UI
[ https://issues.apache.org/jira/browse/MESOS-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410057#comment-15410057 ] Michael Gummelt commented on MESOS-5998: Can I add attachments to this JIRA? I don't see how. > FINISHED task shown as Active in the UI > --- > > Key: MESOS-5998 > URL: https://issues.apache.org/jira/browse/MESOS-5998 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Michael Gummelt > > http://mgummelt-mesos.s3.amazonaws.com/ui_screenshot.png -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5998) FINISHED task shown as Active in the UI
[ https://issues.apache.org/jira/browse/MESOS-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-5998: --- Description: http://mgummelt-mesos.s3.amazonaws.com/ui_screenshot.png > FINISHED task shown as Active in the UI > --- > > Key: MESOS-5998 > URL: https://issues.apache.org/jira/browse/MESOS-5998 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Michael Gummelt > > http://mgummelt-mesos.s3.amazonaws.com/ui_screenshot.png -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5998) FINISHED task shown as Active in the UI
Michael Gummelt created MESOS-5998: -- Summary: FINISHED task shown as Active in the UI Key: MESOS-5998 URL: https://issues.apache.org/jira/browse/MESOS-5998 Project: Mesos Issue Type: Bug Components: webui Affects Versions: 1.0.0 Reporter: Michael Gummelt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5971) Better handling for docker credentials
[ https://issues.apache.org/jira/browse/MESOS-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-5971: --- Component/s: docker containerization > Better handling for docker credentials > -- > > Key: MESOS-5971 > URL: https://issues.apache.org/jira/browse/MESOS-5971 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Affects Versions: 1.0.0 >Reporter: Michael Gummelt > > Users often want to run Spark jobs in custom docker images that reside in > private registries. We can adapt the marathon approach of passing docker > configs as fetcher URIs: > https://mesosphere.github.io/marathon/docs/native-docker-private-registry.html > But this is a hack. It would be nice if I could configure a mesos agent with > docker credentials beforehand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5971) Better handling for docker credentials
Michael Gummelt created MESOS-5971: -- Summary: Better handling for docker credentials Key: MESOS-5971 URL: https://issues.apache.org/jira/browse/MESOS-5971 Project: Mesos Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Michael Gummelt Users often want to run Spark jobs in custom docker images that reside in private registries. We can adapt the marathon approach of passing docker configs as fetcher URIs: https://mesosphere.github.io/marathon/docs/native-docker-private-registry.html But this is a hack. It would be nice if I could configure a mesos agent with docker credentials beforehand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5866) MESOS_DIRECTORY set to a host path when using a docker image w/ unified containerizer
Michael Gummelt created MESOS-5866: -- Summary: MESOS_DIRECTORY set to a host path when using a docker image w/ unified containerizer Key: MESOS-5866 URL: https://issues.apache.org/jira/browse/MESOS-5866 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.28.2 Reporter: Michael Gummelt Running Spark with the unified containerizer, it fails with: {code} 16/07/19 21:03:09 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:36) failed in Unknown s due to Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /var/lib/mesos/slave/slaves/003ebcc2-64e2-488f-87b9-f6fa7630c01b-S0/frameworks/003ebcc2-64e2-488f-87b9-f6fa7630c01b-0001/executors/driver-20160719210109-0002/runs/8f21b32e-b929-4369-bce9-9f49a3a8844f/blockmgr-e3a611d4-e0de-48cb-b17a-1e41d97e84c2/11. {code} This is because MESOS_DIRECTORY is set to /var/lib/mesos/, which is a host path. The container can't see the host path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5865) MESOS_DIRECTORY is not set in the docker containerizer
Michael Gummelt created MESOS-5865: -- Summary: MESOS_DIRECTORY is not set in the docker containerizer Key: MESOS-5865 URL: https://issues.apache.org/jira/browse/MESOS-5865 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.28.2 Reporter: Michael Gummelt I'm running Spark with the docker containerizer. It sets MESOS_SANDBOX, but not MESOS_DIRECTORY. The docs indicate that MESOS_DIRECTORY should be set: https://github.com/apache/mesos/blob/2127376b8e092684312ec9843173b532df931d20/docs/executor-http-api.md#executor-environment-variables It would be preferable for there to be just one env var containing the sandbox location, independent of containerizer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5785) Port documentation mistakes - ephemeral ports
Michael Gummelt created MESOS-5785: -- Summary: Port documentation mistakes - ephemeral ports Key: MESOS-5785 URL: https://issues.apache.org/jira/browse/MESOS-5785 Project: Mesos Issue Type: Bug Reporter: Michael Gummelt The docs here: http://mesos.apache.org/documentation/latest/attributes-resources/ Should probably recommend that users not configure their agents to offer ports in the ephemeral port range (32768+: https://en.wikipedia.org/wiki/Ephemeral_port). We avoid this in DC/OS, for example. The example includes ports offered in this range, so we should fix that. Further the docs state that ports have "pre-defined behavior", but they don't state what this is, and I'm not even clear myself what this is. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5754) CommandInfo.user not honored in docker containerizer
[ https://issues.apache.org/jira/browse/MESOS-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358002#comment-15358002 ] Michael Gummelt commented on MESOS-5754: > The workaround is to specify a CLI parameter: Assuming you're launching through marathon, yes > CommandInfo.user not honored in docker containerizer > > > Key: MESOS-5754 > URL: https://issues.apache.org/jira/browse/MESOS-5754 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Michael Gummelt > > Repro by creating a framework that starts a task with CommandInfo.user set, > and observe that the dockerized executor is still running as the default > (e.g. root). > cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5754) CommandInfo.user not honored in docker containerizer
Michael Gummelt created MESOS-5754: -- Summary: CommandInfo.user not honored in docker containerizer Key: MESOS-5754 URL: https://issues.apache.org/jira/browse/MESOS-5754 Project: Mesos Issue Type: Bug Affects Versions: 1.0.0 Reporter: Michael Gummelt Repro by creating a framework that starts a task with CommandInfo.user set, and observe that the dockerized executor is still running as the default (e.g. root). cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3220) Offer ability to kill tasks from the API
[ https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276822#comment-15276822 ] Michael Gummelt commented on MESOS-3220: +1. I'm implementing this behavior in Spark. It would be more efficient if mesos offered it, so we wouldn't have to reimplement at the framework level. > Offer ability to kill tasks from the API > > > Key: MESOS-3220 > URL: https://issues.apache.org/jira/browse/MESOS-3220 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Sunil Shah > Labels: mesosphere > > We are investigating adding a {{dcos task kill}} command to our DCOS (and > Mesos) command line interface. Currently the ability to kill tasks is only > offered via the scheduler API so it would be useful to have some ability to > kill tasks directly. > This would complement the Maintenance Primitives, in that it would enable the > operator to terminate those tasks which, for whatever reasons, do not respond > to Inverse Offers events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5197) Log executor commands w/o verbose logs enabled
[ https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273184#comment-15273184 ] Michael Gummelt commented on MESOS-5197: [~kaysoky] How can we solve this problem? I'm working with yet another customer where this info would be invaluable. Generally "My command is failing. What was the command?" Is a very common scenario. > Log executor commands w/o verbose logs enabled > -- > > Key: MESOS-5197 > URL: https://issues.apache.org/jira/browse/MESOS-5197 > Project: Mesos > Issue Type: Task >Reporter: Michael Gummelt >Assignee: Yong Tang > Labels: mesosphere > > To debug executors, it's often necessary to know the command that ran the > executor. For example, when Spark executors fail, I'd like to know the > command used to invoke the executor (Spark uses the command executor in a > docker container). Currently, it's only output if GLOG_v is enabled, but I > don't think this should be a "verbose" output. It's a common debugging need. > https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677 > cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5197) Log executor commands w/o verbose logs enabled
[ https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245949#comment-15245949 ] Michael Gummelt commented on MESOS-5197: How can we make it so the commands are printed? > Log executor commands w/o verbose logs enabled > -- > > Key: MESOS-5197 > URL: https://issues.apache.org/jira/browse/MESOS-5197 > Project: Mesos > Issue Type: Task >Reporter: Michael Gummelt >Assignee: Yong Tang > Labels: mesosphere > > To debug executors, it's often necessary to know the command that ran the > executor. For example, when Spark executors fail, I'd like to know the > command used to invoke the executor (Spark uses the command executor in a > docker container). Currently, it's only output if GLOG_v is enabled, but I > don't think this should be a "verbose" output. It's a common debugging need. > https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677 > cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5197) Log executor commands w/o verbose logs enabled
[ https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245948#comment-15245948 ] Michael Gummelt commented on MESOS-5197: How can we make it so the commands are printed? > Log executor commands w/o verbose logs enabled > -- > > Key: MESOS-5197 > URL: https://issues.apache.org/jira/browse/MESOS-5197 > Project: Mesos > Issue Type: Task >Reporter: Michael Gummelt >Assignee: Yong Tang > Labels: mesosphere > > To debug executors, it's often necessary to know the command that ran the > executor. For example, when Spark executors fail, I'd like to know the > command used to invoke the executor (Spark uses the command executor in a > docker container). Currently, it's only output if GLOG_v is enabled, but I > don't think this should be a "verbose" output. It's a common debugging need. > https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677 > cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5197) Log executor commands w/o verbose logs enabled
[ https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-5197: --- Comment: was deleted (was: How can we make it so the commands are printed?) > Log executor commands w/o verbose logs enabled > -- > > Key: MESOS-5197 > URL: https://issues.apache.org/jira/browse/MESOS-5197 > Project: Mesos > Issue Type: Task >Reporter: Michael Gummelt >Assignee: Yong Tang > Labels: mesosphere > > To debug executors, it's often necessary to know the command that ran the > executor. For example, when Spark executors fail, I'd like to know the > command used to invoke the executor (Spark uses the command executor in a > docker container). Currently, it's only output if GLOG_v is enabled, but I > don't think this should be a "verbose" output. It's a common debugging need. > https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677 > cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5198) state.json incorrectly serves an empty {{executors}} field
Michael Gummelt created MESOS-5198: -- Summary: state.json incorrectly serves an empty {{executors}} field Key: MESOS-5198 URL: https://issues.apache.org/jira/browse/MESOS-5198 Project: Mesos Issue Type: Bug Affects Versions: 0.28.1 Reporter: Michael Gummelt The {{frameworks.executors}} array in {{state.json}} is empty, despite the framework having running tasks. I believe this is incorrect, since you can't have tasks w/o an executor. Perhaps the intended meaning is "custom executors", but I think we should serve info for all executors run by the framework, including command executors. I often need to look up, for example, which command is run by the command executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5197) Log executor commands w/o verbose logs enabled
Michael Gummelt created MESOS-5197: -- Summary: Log executor commands w/o verbose logs enabled Key: MESOS-5197 URL: https://issues.apache.org/jira/browse/MESOS-5197 Project: Mesos Issue Type: Task Reporter: Michael Gummelt To debug executors, it's often necessary to know the command that ran the executor. For example, when Spark executors fail, I'd like to know the command used to invoke the executor (Spark uses the command executor in a docker container). Currently, it's only output if GLOG_v is enabled, but I don't think this should be a "verbose" output. It's a common debugging need. https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677 cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5197) Log executor commands w/o verbose logs enabled
[ https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-5197: --- Labels: mesosphere (was: ) > Log executor commands w/o verbose logs enabled > -- > > Key: MESOS-5197 > URL: https://issues.apache.org/jira/browse/MESOS-5197 > Project: Mesos > Issue Type: Task >Reporter: Michael Gummelt > Labels: mesosphere > > To debug executors, it's often necessary to know the command that ran the > executor. For example, when Spark executors fail, I'd like to know the > command used to invoke the executor (Spark uses the command executor in a > docker container). Currently, it's only output if GLOG_v is enabled, but I > don't think this should be a "verbose" output. It's a common debugging need. > https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677 > cc [~kaysoky] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4769) Update state endpoints to allow clients to determine how many resources for a given role have been used
[ https://issues.apache.org/jira/browse/MESOS-4769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-4769: --- Labels: mesosphere (was: ) > Update state endpoints to allow clients to determine how many resources for a > given role have been used > --- > > Key: MESOS-4769 > URL: https://issues.apache.org/jira/browse/MESOS-4769 > Project: Mesos > Issue Type: Task >Affects Versions: 0.27.1 >Reporter: Michael Gummelt > Labels: mesosphere > > AFAICT, this is currently impossible. Say I have a cluster with 4CPUs > reserved for {{spark}} and 4CPUs unreserved, I have a framework registered as > {{spark}}, and I would like to determine how many CPUs reserved for {{Spark}} > have been used. AFAIK, there are two endpoints with interesting information: > {{/master/state}} and {{/master/roles}}. Both endpoints tell me how many > resources are used by the framework registered as {{spark}}, but it doesn't > tell me which role those resources belong to (i.e. are they reserved or > unreserved). > A simple fix would be to update {{/master/roles}} to split out resources into > "reserved" and "unreserved". However, this will fail to solve the problem if > (and hopefully when) Mesos supports multi-role frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4769) Update state endpoints to allow clients to determine how many resources for a given role have been used
Michael Gummelt created MESOS-4769: -- Summary: Update state endpoints to allow clients to determine how many resources for a given role have been used Key: MESOS-4769 URL: https://issues.apache.org/jira/browse/MESOS-4769 Project: Mesos Issue Type: Task Affects Versions: 0.27.1 Reporter: Michael Gummelt AFAICT, this is currently impossible. Say I have a cluster with 4CPUs reserved for {{spark}} and 4CPUs unreserved, I have a framework registered as {{spark}}, and I would like to determine how many CPUs reserved for {{Spark}} have been used. AFAIK, there are two endpoints with interesting information: {{/master/state}} and {{/master/roles}}. Both endpoints tell me how many resources are used by the framework registered as {{spark}}, but it doesn't tell me which role those resources belong to (i.e. are they reserved or unreserved). A simple fix would be to update {{/master/roles}} to split out resources into "reserved" and "unreserved". However, this will fail to solve the problem if (and hopefully when) Mesos supports multi-role frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4751) Convenient API for getting free resources by role
[ https://issues.apache.org/jira/browse/MESOS-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-4751: --- Priority: Minor (was: Major) > Convenient API for getting free resources by role > - > > Key: MESOS-4751 > URL: https://issues.apache.org/jira/browse/MESOS-4751 > Project: Mesos > Issue Type: Task > Components: json api >Reporter: Michael Gummelt >Priority: Minor > > /master/roles provides allocation by role, but it doesn't provide the total > resources assigned to each role, so I can't compute the remaining resources. > It seems natural that this endpoint should also include the total assigned to > each role. > Also, please consider normalizing the data in `state.json`. e.g.: > {code:javascript} > "resources": [ > { > "cpus" > "disk" > "mem" > "role" > "used" > } > ] > {code} > It would make it easier to support arbitrary queries if the data were > normalized as such. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4751) Convenient API for getting free resources by role
[ https://issues.apache.org/jira/browse/MESOS-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-4751: --- Description: /master/roles provides allocation by role, but it doesn't provide the total resources assigned to each role, so I can't compute the remaining resources. It seems natural that this endpoint should also include the total assigned to each role. Also, please consider normalizing the data in `state.json`. e.g.: {code:javascript} "resources": [ { "cpus" "disk" "mem" "role" "used" } ] {code} It would make it easier to support arbitrary queries if the data were normalized as such. was: /master/roles provides allocation by role, but it doesn't provide the total resources assigned to each role, so I can't compute the remaining resources. It seems natural that this endpoint should also include the total assigned to each role. Also, please consider normalizing the data in `state.json`. e.g.: {{ "resources": [ { "cpus" "disk" "mem" "role" "used" } ] }} It would make it easier to support arbitrary queries if the data were normalized as such. > Convenient API for getting free resources by role > - > > Key: MESOS-4751 > URL: https://issues.apache.org/jira/browse/MESOS-4751 > Project: Mesos > Issue Type: Task > Components: json api >Reporter: Michael Gummelt > > /master/roles provides allocation by role, but it doesn't provide the total > resources assigned to each role, so I can't compute the remaining resources. > It seems natural that this endpoint should also include the total assigned to > each role. > Also, please consider normalizing the data in `state.json`. e.g.: > {code:javascript} > "resources": [ > { > "cpus" > "disk" > "mem" > "role" > "used" > } > ] > {code} > It would make it easier to support arbitrary queries if the data were > normalized as such. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4698) "Composing" containerizer docs are confusing
[ https://issues.apache.org/jira/browse/MESOS-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152885#comment-15152885 ] Michael Gummelt commented on MESOS-4698: To "compose" means to combine two things to form something else. This containerizer isn't doing that. It's using EITHER the mesos or the docker containerizer. That's not composition. Even if we can't agree on the definition of the work, just as evidence that it's confusing, both me and my peer at Typesafe independently interpreted "composition" to mean something like nesting. Also, it's inconsistent to list it in the docs as a containerizer type, then not include it in the list of `--containerizer` options. > "Composing" containerizer docs are confusing > > > Key: MESOS-4698 > URL: https://issues.apache.org/jira/browse/MESOS-4698 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Michael Gummelt > Labels: mesosphere > > Both I and my peer at Typesafe have found the containerizer docs confusing > (The 'Composing Containerizer' part) > https://github.com/apache/mesos/blob/master/docs/containerizer.md > "composing" suggests that I can launch tasks in nested containers > Also, the structure of the docs suggest that there's a third container type > called "composing", which is not true, or it's at least not exposed in the > UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4698) "Composing" containerizer docs are confusing
Michael Gummelt created MESOS-4698: -- Summary: "Composing" containerizer docs are confusing Key: MESOS-4698 URL: https://issues.apache.org/jira/browse/MESOS-4698 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Michael Gummelt Both I and my peer at Typesafe have found the docs confusing "composing" suggests that I can launch tasks in nested containers Also, the structure of the docs suggest that there's a third container type called "composing", which is not true, or it's at least not exposed in the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4582) state.json serving duplicate "active" fields
[ https://issues.apache.org/jira/browse/MESOS-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-4582: --- Attachment: error.json > state.json serving duplicate "active" fields > > > Key: MESOS-4582 > URL: https://issues.apache.org/jira/browse/MESOS-4582 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27 >Reporter: Michael Gummelt > Attachments: error.json > > > state.json is serving duplicate "active" fields in frameworks. See the > framework "47df96c2-3f85-4bc5-b781-709b2c30c752-" In the attached file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4582) state.json serving duplicate "active" fields
Michael Gummelt created MESOS-4582: -- Summary: state.json serving duplicate "active" fields Key: MESOS-4582 URL: https://issues.apache.org/jira/browse/MESOS-4582 Project: Mesos Issue Type: Bug Affects Versions: 0.27 Reporter: Michael Gummelt Attachments: error.json state.json is serving duplicate "active" fields in frameworks. See the framework "47df96c2-3f85-4bc5-b781-709b2c30c752-" In the attached file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4585) mesos-fetcher LIBPROCESS_PORT set to 5051 URI fetch failure
[ https://issues.apache.org/jira/browse/MESOS-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-4585: --- Attachment: hdfs-stderr.log HDFS links are also failing. See attached log. > mesos-fetcher LIBPROCESS_PORT set to 5051 URI fetch failure > --- > > Key: MESOS-4585 > URL: https://issues.apache.org/jira/browse/MESOS-4585 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Drew Robb > Attachments: hdfs-stderr.log > > > When starting a task with a {{s3a://}} URI, the fetcher fails to download the > URI, failing when trying to bind to the slave's port 5051. The URI gets > successfully downloaded, but the error is fatal. If the URI is changed to > {{http://}}. The root cause of this is that apparently the mesos-fetcher > process has {{LIBPROCESS_PORT=5051}} in its environment as I was able to find > from {{cat "/proc/`pgrep mesos-fetcher`/environ"}}. > stderr from a failing task: > {quote} > I0203 00:11:55.815500 4964 fetcher.cpp:424] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ede0e5bc-d7ac-4b9a-8d35-b210fa785db0-S0","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":true,"value":"s3a:\/\/strava.mesos\/foo"}}],"sandbox_directory":"\/mnt\/mesos\/slaves\/ede0e5bc-d7ac-4b9a-8d35-b210fa785db0-S0\/frameworks\/fe927665-1516-46cf-94dd-6d2ca84007f1-\/executors\/uris-test.bc047306-ca0a-11e5-b742-e2162bf6108e\/runs\/24ebd807-b065-4776-a0bf-84bda4a82f01"} > I0203 00:11:55.816830 4964 fetcher.cpp:379] Fetching URI > 's3a://strava.mesos/foo' > I0203 00:11:55.816846 4964 fetcher.cpp:250] Fetching directly into the > sandbox directory > I0203 00:11:55.816864 4964 fetcher.cpp:187] Fetching URI > 's3a://strava.mesos/foo' > I0203 00:11:56.191640 4964 fetcher.cpp:109] Downloading resource with Hadoop > client from 's3a://strava.mesos/foo' to > '/mnt/mesos/slaves/ede0e5bc-d7ac-4b9a-8d35-b210fa785db0-S0/frameworks/fe927665-1516-46cf-94dd-6d2ca84007f1-/executors/uris-test.bc047306-ca0a-11e5-b742-e2162bf6108e/runs/24ebd807-b065-4776-a0bf-84bda4a82f01/foo' > F0203 00:11:56.192503 4964 process.cpp:892] Failed to initialize: Failed to > bind on 0.0.0.0:5051: Address already in use: Address already in use [98] > *** Check failure stack trace: *** > @ 0x7f229ce50e7d google::LogMessage::Fail() > @ 0x7f229ce52c10 google::LogMessage::SendToLog() > @ 0x7f229ce50a42 google::LogMessage::Flush() > @ 0x7f229ce50c89 google::LogMessage::~LogMessage() > @ 0x7f229ce51c32 google::ErrnoLogMessage::~ErrnoLogMessage() > @ 0x7f229cdf16b9 process::initialize() > @ 0x7f229cdf2f36 process::ProcessBase::ProcessBase() > @ 0x7f229ce22875 process::reap() > @ 0x7f229ce2ced7 process::subprocess() > @ 0x7f229c50ab7b HDFS::copyToLocal() > @ 0x40f03e download() > @ 0x40b69f main > @ 0x7f229adc8a40 (unknown) > @ 0x40cf59 _start > Aborted (core dumped) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3866) The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors
[ https://issues.apache.org/jira/browse/MESOS-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068375#comment-15068375 ] Michael Gummelt commented on MESOS-3866: It's not a dupe. MESOS-3751 regards MESOS_NATIVE_JAVA_LIBRARY not being set when it should (in mesos). This regards it being set when it shouldn't (in docker). > The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors > --- > > Key: MESOS-3866 > URL: https://issues.apache.org/jira/browse/MESOS-3866 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.25.0 >Reporter: Michael Gummelt > > It's set here: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L281 > And passed to the docker executor here: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L844 > This leaks the host path of the library into the docker image, which of > course can't see it. This is breaking DCOS Spark, which runs in a docker > image that has set its own value for MESOS_NATIVE_JAVA_LIBRARY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3866) The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors
Michael Gummelt created MESOS-3866: -- Summary: The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors Key: MESOS-3866 URL: https://issues.apache.org/jira/browse/MESOS-3866 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.25.0 Reporter: Michael Gummelt It's set here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L281 And passed to the docker executor here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L844 This leaks the host path of the library into the docker image, which of course can't see it. This is breaking Spark, which runs in a docker image that has set its own value for MESOS_NATIVE_JAVA_LIBRARY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3866) The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors
[ https://issues.apache.org/jira/browse/MESOS-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-3866: --- Description: It's set here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L281 And passed to the docker executor here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L844 This leaks the host path of the library into the docker image, which of course can't see it. This is breaking DCOS Spark, which runs in a docker image that has set its own value for MESOS_NATIVE_JAVA_LIBRARY. was: It's set here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L281 And passed to the docker executor here: https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L844 This leaks the host path of the library into the docker image, which of course can't see it. This is breaking Spark, which runs in a docker image that has set its own value for MESOS_NATIVE_JAVA_LIBRARY. > The docker containerizer sets MESOS_NATIVE_JAVA_LIBRARY in docker executors > --- > > Key: MESOS-3866 > URL: https://issues.apache.org/jira/browse/MESOS-3866 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.25.0 >Reporter: Michael Gummelt > > It's set here: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L281 > And passed to the docker executor here: > https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L844 > This leaks the host path of the library into the docker image, which of > course can't see it. This is breaking DCOS Spark, which runs in a docker > image that has set its own value for MESOS_NATIVE_JAVA_LIBRARY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers
[ https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996136#comment-14996136 ] Michael Gummelt commented on MESOS-3836: bq. Every marathon app task got every environment variable that mesos-slave had unless the marathon app definition explicitly overrode it. That's because marathon tasks run under the command executor. As I said this is the only scenario where you can say with certainty that tasks inherit env vars from the host. bq. Executors in many ways re like Tasks and should be fully containerized like them I'm not sure what you mean by "fully" containerized, but tasks aren't fully isolated. In fact, you can't really say anything about tasks. It doesn't really even make sense to talk about env vars set on tasks, because tasks aren't even processes necessarily. All of this env var talk only applies to executors. We should be clear with terms. Definitional nitpicks aside, I do agree that we should head toward total host isolation, but let's focus on solving the immediate problem. > `--executor-environment-variables` may not apply to docker containers > - > > Key: MESOS-3836 > URL: https://issues.apache.org/jira/browse/MESOS-3836 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Affects Versions: 0.25.0 > Environment: Mesos 0.25.0 configured with > --executor-environment-variables >Reporter: Cody Maloney >Assignee: Marco Massenzio >Priority: Minor > Labels: mesosphere > > In our use case we set {{PATH}} as part of the > {{\-\-executor_environment_variables}} in order to limit what binaries all > tasks which are launched via Mesos have readily available to them, making it > much harder for people launching tasks on mesos to accidentally depend on > something which isn't part of the "guaranteed" environment / platform. > Docker containers can be used as executors, and have a fully isolated > filesystem. For executors which run in docker containers setting {{PATH}} to > our path on the host filesystem may potentially break the docker container. > The previous code of only copying across environment variables when > {{includeOsEnvironment}} is set dealt with this > (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267) > if {{includeOsEnvironment}} is set than we should copy across the current > {{\-\-executor_environment_variables}}. If it isn't, then > {{\-\-executor_environment_variables}} shouldn't be used at all. > Another option which could be useful is to make it so that there are two sets > of "Executor Environment Variables". One for when {{includeOsEnvironment}} is > set, and one for when it is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers
[ https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996101#comment-14996101 ] Michael Gummelt commented on MESOS-3836: It looks like the original goal of MESOS-2832, where {{--executor-environment-variables}} was introduced, was to replace the inherited host environment with a different environment, which would only apply to non-docker containers, since they're the only ones that inherit the host environment. However, as implemented, it's set on all executors. So the central question is whether we want to keep the functionality of setting env vars on all executors, or do we want to revert to the original goal of replacing the inherited host environnment, which would only apply to non-docker containers (mesos and external). [~tnachen]: I don't see how your proposal for a {{-docker-task-environment-variables}} flag solves the {{PATH}} problem. Adding more docker env vars doesn't prevent us from setting the existing {{--executor-environment-variables}} on docker executors. [~cmaloney]: bq. The -executor-environment-variables is given directly to executors, and then gets inherited from the executor by all tasks the executors launch currently. Not really. Custom executors can launch tasks however they want. It's up to them whether or not they pass their env vars. And the docker command executors (mesos-docker-executor) doesn't pass env vars through. So this is really only true for the mesos command executor. > `--executor-environment-variables` may not apply to docker containers > - > > Key: MESOS-3836 > URL: https://issues.apache.org/jira/browse/MESOS-3836 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Affects Versions: 0.25.0 > Environment: Mesos 0.25.0 configured with > --executor-environment-variables >Reporter: Cody Maloney >Assignee: Marco Massenzio >Priority: Minor > Labels: mesosphere > > In our use case we set {{PATH}} as part of the > {{\-\-executor_environment_variables}} in order to limit what binaries all > tasks which are launched via Mesos have readily available to them, making it > much harder for people launching tasks on mesos to accidentally depend on > something which isn't part of the "guaranteed" environment / platform. > Docker containers can be used as executors, and have a fully isolated > filesystem. For executors which run in docker containers setting {{PATH}} to > our path on the host filesystem may potentially break the docker container. > The previous code of only copying across environment variables when > {{includeOsEnvironment}} is set dealt with this > (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267) > if {{includeOsEnvironment}} is set than we should copy across the current > {{\-\-executor_environment_variables}}. If it isn't, then > {{\-\-executor_environment_variables}} shouldn't be used at all. > Another option which could be useful is to make it so that there are two sets > of "Executor Environment Variables". One for when {{includeOsEnvironment}} is > set, and one for when it is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers
[ https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996144#comment-14996144 ] Michael Gummelt commented on MESOS-3836: bq. I mean every executor should adhere to the same isolators that tasks do Isolators are set on containers. Thus executors and tasks, which run in containers, adhere to the same isolators. There are no isolators that tasks adhere to that executors don't. > `--executor-environment-variables` may not apply to docker containers > - > > Key: MESOS-3836 > URL: https://issues.apache.org/jira/browse/MESOS-3836 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Affects Versions: 0.25.0 > Environment: Mesos 0.25.0 configured with > --executor-environment-variables >Reporter: Cody Maloney >Assignee: Marco Massenzio >Priority: Minor > Labels: mesosphere > > In our use case we set {{PATH}} as part of the > {{\-\-executor_environment_variables}} in order to limit what binaries all > tasks which are launched via Mesos have readily available to them, making it > much harder for people launching tasks on mesos to accidentally depend on > something which isn't part of the "guaranteed" environment / platform. > Docker containers can be used as executors, and have a fully isolated > filesystem. For executors which run in docker containers setting {{PATH}} to > our path on the host filesystem may potentially break the docker container. > The previous code of only copying across environment variables when > {{includeOsEnvironment}} is set dealt with this > (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267) > if {{includeOsEnvironment}} is set than we should copy across the current > {{\-\-executor_environment_variables}}. If it isn't, then > {{\-\-executor_environment_variables}} shouldn't be used at all. > Another option which could be useful is to make it so that there are two sets > of "Executor Environment Variables". One for when {{includeOsEnvironment}} is > set, and one for when it is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers
[ https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996105#comment-14996105 ] Michael Gummelt commented on MESOS-3836: If we decide to keep the existing functionality, my proposal is to have both {{-executor-environment-variables}} and something like {{--inherited-environment-variables}} or {{-host-environment-variables}}. The former would set env vars on all executors. The latter would set the inherited environment for containers, which would only apply to those containerizers that inherit the host environment (mesos and external) > `--executor-environment-variables` may not apply to docker containers > - > > Key: MESOS-3836 > URL: https://issues.apache.org/jira/browse/MESOS-3836 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Affects Versions: 0.25.0 > Environment: Mesos 0.25.0 configured with > --executor-environment-variables >Reporter: Cody Maloney >Assignee: Marco Massenzio >Priority: Minor > Labels: mesosphere > > In our use case we set {{PATH}} as part of the > {{\-\-executor_environment_variables}} in order to limit what binaries all > tasks which are launched via Mesos have readily available to them, making it > much harder for people launching tasks on mesos to accidentally depend on > something which isn't part of the "guaranteed" environment / platform. > Docker containers can be used as executors, and have a fully isolated > filesystem. For executors which run in docker containers setting {{PATH}} to > our path on the host filesystem may potentially break the docker container. > The previous code of only copying across environment variables when > {{includeOsEnvironment}} is set dealt with this > (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267) > if {{includeOsEnvironment}} is set than we should copy across the current > {{\-\-executor_environment_variables}}. If it isn't, then > {{\-\-executor_environment_variables}} shouldn't be used at all. > Another option which could be useful is to make it so that there are two sets > of "Executor Environment Variables". One for when {{includeOsEnvironment}} is > set, and one for when it is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3836) `--executor-environment-variables` may not apply to docker containers
[ https://issues.apache.org/jira/browse/MESOS-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996144#comment-14996144 ] Michael Gummelt edited comment on MESOS-3836 at 11/9/15 7:15 AM: - bq. I mean every executor should adhere to the same isolators that tasks do Isolators are set on containers (or rather, they define containers). Thus executors and tasks, which run in containers, adhere to the same isolators. There are no isolators that tasks adhere to that executors don't. was (Author: mgummelt): bq. I mean every executor should adhere to the same isolators that tasks do Isolators are set on containers. Thus executors and tasks, which run in containers, adhere to the same isolators. There are no isolators that tasks adhere to that executors don't. > `--executor-environment-variables` may not apply to docker containers > - > > Key: MESOS-3836 > URL: https://issues.apache.org/jira/browse/MESOS-3836 > Project: Mesos > Issue Type: Bug > Components: containerization, slave >Affects Versions: 0.25.0 > Environment: Mesos 0.25.0 configured with > --executor-environment-variables >Reporter: Cody Maloney >Assignee: Marco Massenzio >Priority: Minor > Labels: mesosphere > > In our use case we set {{PATH}} as part of the > {{\-\-executor_environment_variables}} in order to limit what binaries all > tasks which are launched via Mesos have readily available to them, making it > much harder for people launching tasks on mesos to accidentally depend on > something which isn't part of the "guaranteed" environment / platform. > Docker containers can be used as executors, and have a fully isolated > filesystem. For executors which run in docker containers setting {{PATH}} to > our path on the host filesystem may potentially break the docker container. > The previous code of only copying across environment variables when > {{includeOsEnvironment}} is set dealt with this > (https://github.com/apache/mesos/blob/56510afe149758a69a5a714dfaab16111dd0d9c3/src/slave/containerizer/containerizer.cpp#L267) > if {{includeOsEnvironment}} is set than we should copy across the current > {{\-\-executor_environment_variables}}. If it isn't, then > {{\-\-executor_environment_variables}} shouldn't be used at all. > Another option which could be useful is to make it so that there are two sets > of "Executor Environment Variables". One for when {{includeOsEnvironment}} is > set, and one for when it is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2797) mesos-slave dies when it hits open file descriptor limit
Michael Gummelt created MESOS-2797: -- Summary: mesos-slave dies when it hits open file descriptor limit Key: MESOS-2797 URL: https://issues.apache.org/jira/browse/MESOS-2797 Project: Mesos Issue Type: Bug Components: general Affects Versions: 0.22.1 Reporter: Michael Gummelt I'm running mesos-slave under systemd as part of Mesosphere's DCOS. The slave process is repeatedly dying as it hits the system's open file descriptor limit of 1024. See the below master-slave.log file. I stop mesos-slave, remove the directory specified in the slave logs, and still get the same error. lsof shows that mesos-slave is opening several hundred pipes. See the below lsof.log file. mesos-slave.log Jun 01 23:49:19 dcos-01 systemd[1]: mesos-slave.service holdoff time over, scheduling restart. Jun 01 23:49:19 dcos-01 systemd[1]: Stopping Mesos Slave... Jun 01 23:49:19 dcos-01 systemd[1]: Starting Mesos Slave... Jun 01 23:49:19 dcos-01 ping[14896]: PING leader.mesos (172.17.8.101) 56(84) bytes of data. Jun 01 23:49:19 dcos-01 ping[14896]: 64 bytes from dcos-01 (172.17.8.101): icmp_seq=1 ttl=64 time=0.023 ms Jun 01 23:49:19 dcos-01 ping[14896]: --- leader.mesos ping statistics --- Jun 01 23:49:19 dcos-01 ping[14896]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms Jun 01 23:49:19 dcos-01 ping[14896]: rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms Jun 01 23:49:19 dcos-01 systemd[1]: Started Mesos Slave. Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.713110 14899 logging.cpp:172] INFO level logging started! Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715564 14899 main.cpp:156] Build: 2015-05-19 18:43:41 by Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715600 14899 main.cpp:158] Version: 0.22.1 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715618 14899 main.cpp:165] Git SHA: dd082c8656eb6e93e091a12fc5cfee3700a61bb1 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.830142 14899 containerizer.cpp:110] Using isolation: cgroups/cpu,cgroups/mem Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.845340 14899 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.845696 14899 main.cpp:200] Starting Mesos slave Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,845:14899(0x7f111ff43700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@716: Client environment:host.name=dcos-01 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@723: Client environment:os.name=Linux Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@724: Client environment:os.arch=3.19.0 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@725: Client environment:os.version=#2 SMP Thu Mar 26 10:44:46 UTC 2015 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@733: Client environment:user.name=(null) Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@741: Client environment:user.home=/root Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@753: Client environment:user.dir=/ Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=leader.mesos:2181 sessionTimeout=1 watcher=0x7f11246c0140 sessionId=0 sessionPasswd=null context=0x7f1114000b40 flags=0 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.846161 14899 slave.cpp:174] Slave started on 1)@172.17.8.101:5051 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.846206 14899 slave.cpp:194] Moving slave process into its own cgroup for subsystem: cpu Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,855:14899(0x7f110bde7700):ZOO_INFO@check_events@1703: initiated connection to server [172.17.8.101:2181] Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,855:14899(0x7f110bde7700):ZOO_INFO@check_events@1750: session establishment complete on server [172.17.8.101:2181], sessionId=0x14d77b31175030e, negotiated timeout=1 Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.856979 14900 group.cpp:313] Group process (group(1)@172.17.8.101:5051) connected to ZooKeeper Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.857028 14900 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) Jun